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Abstract 

Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs 
a high entropy distribution upon which weak learners and training labels are uncorrelated. This 
manuscript studies this primal-dual relationship under a broad family of losses, including the 
exponential loss of AdaBoost and the logistic loss, revealing: 

• Weak learnability aids the whole loss family: for any e > 0, C'(ln(l/e)) iterations suffice 
to produce a predictor with empirical risk e-close to the infimum; 

• The circumstances granting the existence of an empirical risk minimizer may be charac- 
terized in terms of the primal and dual problems, yielding a new proof of the known rate 
0(ln(l/£)); 

• Arbitrary instances may be decomposed into the above two, granting rate 0{l/e), with a 
matching lower bound provided for the logistic loss. 

1 Introduction 

Boosting is the task of converting inaccurate weak learners into a single accurate predictor. The 



existence of any such method was unknovirn until the breakthrough result of Schapire (19901: under 
a weak learning assumption, it is possible to combine many carefully chosen weak learners into a 
majority of majorities with arbitrarily low training error. Soon after, Freund ( 1995 1 noted that a 
single majority is enough, and that 0(ln(l/e)) iterations are both necessary and sufficient to attain 
accuracy e. Finally, their combined effort produced AdaBoost, which exhibits this optimal conver- 
gence rate (under the weak learning assumption), and has an astonishingly simple implementation 



(Freund and Schapire 1997) 



It was eventually revealed that AdaBoost was minimizing a risk functional, specifically the ex- 
ponential loss (Breiman 1999). Aiming to alleviate perceived deficiencies in the algorithm, other 



loss functions were proposed, foremost amongst these being the logistic loss ( Friedman et al. 2000 1 



Given the wide practical success of boosting with the logistic loss, it is perhaps surprising that no 
convergence rate better than C'(cxp(l/e^)) was known, even under the weak learning assumption 
(Bickel et al. 2006). The reason for this deficiency is simple: unlike SVM, least squares, and basi- 



cally any other optimization problem considered in machine learning, there might not exist a choice 
which attains the minimal risk! This reliance is carried over from convex optimization, where the 
assumption of attainability is generally made, either directly, or through stronger conditions like 



compact level sets or strong convexity (Luo and Tseng, 1992). But this limitation seems artificial: 
a function like exp(— x) has no minimizer but decays rapidly. 

Convergence rate analysis provides a valuable mechanism to compare and improve of minimiza- 
tion algorithms. But there is a deeper significance with boosting: a convergence rate of C'(ln(l/e)) 
means that, with a combination of just 0(ln(l/e)) predictors, one can construct an e-optimal clas- 
sifier, which is crucial to both the computational efficiency and statistical stability of this predictor. 
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The main contribution of this manuscript is to provide a tight convergence theory for a large 
family of losses, including the exponential and logistic losses, which has heretofore resisted analy- 



sis. In particular, it is shown that the (disjoint) scenarios of weak learnability (Section 6.1) and 
attainability (Section |6.2[ ) both exhibit the rate C'(ln(l/e)). These two scenarios are in a strong 
sense extremal, and general instances are shown to decompose into them; but their conflicting be- 



havior yields a degraded rate 0{l/e) (Section 6.3). A matching lower bound for the logistic loss 
demonstrates this is no artifact. 

1.1 Outline 

Beyond providing these rates, this manuscript will study the rich ecology within the primal-dual 
interplay of boosting. 

Starting with necessary background. Section [2] provides the standard view of boosting as coor- 
dinate descent of an empirical risk. This primal formulation of boosting obscures a key internal 
mechanism: boosting iteratively constructs distributions where the previously selected weak learner 
fails. This view is recovered in the dual problem; specifically. Section [3] reveals that the dual feasible 
set is the collection of distributions where all weak learners have no correlation to the target, and 
the dual objective is a max entropy rule. 

The dual optimum is always attainable; since a standard mechanism in convergence analysis to 
control the distance to the optimum, why not overcome the unattainability of the primal optimum 
by working in the dual? It turns out that the classical weak learning rate was a mechanism to control 
distances in the dual all along; by developing a suitable generalization (Section |4]) , it is possible to 
convert the improvement due to a single step of coordinate descent into a relevant distance in the 
dual (Section |6]) . Crucially, this holds for general instances, without any assumptions. 

The final puzzle piece is to relate these dual distances to the optimality gap. Section [5] lays 
the foundation, taking a close look at the structure of the optimization problem. The classical 
scenarios of attainability and weak learnability are identifiable directly from the weak learning class 
and training sample; moreover, they can be entirely characterized by properties of the primal and 
dual problems. 

Section [5] will also reveal another structure: there is a subset of the training set, the hard core, 
which is the maximal support of any distribution upon which every weak learner and the training 
labels are uncorrelated. This set is central — for instance, the dual optimum (regardless of the loss 
function) places positive weight on exactly the hard core. Weak learnability corresponds to the 
hard core being empty, and attainability corresponds to it being the whole training set. For those 
instances where the hard core is a nonempty proper subset of the training set, the behavior on and 



off the hard core mimics attainability and weak learnability, and Section 6.3 will leverage this to 
produce rates using facts derived for the two constituent scenarios. 

Much of the technical material is relegated to the appendices. For convenience, Appendix [A| 
summarizes notation, and Appendix [B] contains some important supporting results. Of perhaps 
practical interest. Appendix [P] provides methods to select the step size, meaning the weight with 
which new weak learners are included in the full predictor. These methods are sufficiently powerful 
to grant the convergence rates in this manuscript. 

1.2 Related Work 

The development of general convergence rates has a number of important milestones in the past 



decade. Collins et al. ( |2002j proved convergence for a large family of losses, albeit without any 
rates. Interestingly, the step size only partially modified the choice from AdaBoost to accommodate 
arbitrary losses, whereas the choice here follows standard optimization principles based purely on the 



particular loss. Next, Bickel et al. (2006) showed a general rate of C'(exp(l/e )) for a slightly smaller 



family of functions: every loss has positive lower and upper bounds on its second derivative within 
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any compact interval. This is a larger family than what is considered in the present manuscript, but 
Section [6.2| will discuss the role of the extra assumptions when producing fast rates. 

Many extremely important cases have also been handled. The first is the original rate of 



0(ln(l/e)) for the exponential loss under the weak learning assumption (Freund and Schapire 
1997). Next, under the assumption that the empirical risk minimizer is attainable, Ratsch et al. 



(2001) demonstrated the rate 0(ln(l/e)). The loss functions in that work must satisfy lower and 
upper bounds on the Hessian within the initial level set; equivalently, the existence of lower and 
upper bounding quadratic functions within this level set. This assumption may be slightly relaxed 
to needing just lower and upper second derivative bounds on the univariate loss function within an 
initial bounding interval (cf. discussion within Section 5.2 1, which is the same set of assumptions 
used by Bickel et al. ( 2006 ), and as discussed in Section 6.2 is all that is really needed by the analysis 
in the present manuscript under attainability. 



Parallel to the present work, Mukherjee et al. (2011) established general convergence under the 
exponential loss, with a rate of G(l/e). That work also presented bounds comparing the AdaBoost 
suboptimality to a ny bounded solution, which can be used to succinctly prove consistency prop- 
erties of AdaBoost (Schapire and Freund in preparation). In this case, the rate degrades to 0{e~^), 
which although presented without lower bound, is not terribly surprising since the optimization 
problem minimized by boosting has no norm penalization. Finally, mirroring the development here. 



Mukherjee et al. (2011) used the same boosting instance (due to Schapire (2010)) to produce lower 



bounds, and also decomposed the boosting problem into finite and infinite margin pieces (cf. Sec- 
tion 



5.31. 



It is interesting to mention that, for many variants of boosting, general convergence rates were 
known. Specifically, once it was revealed that boosting is trying to be not only correct but also 



have large margins (Schapire et al. 1997), much work was invested into methods which explicitly 



maximized the margin ( Ratsch and Warmuth 2002 ) , or penalized variants focused on the inseparable 
case (Warmuth et al. 2007 Shalev-Shwartz and Singer 2008). These methods generally impose 



some form of regularization (Shalev-Shwartz and Singer, 2008), which grants attainability of the 



risk minimizer, and allows standard techniques to grant general convergence rates. Interestingly, the 
guarantees in those works cited in this paragraph are ©(l/e^). 



Warmuth 


( 


1999 





(2002), which demonstrated that boosting is seeking a difficult 
distribution over training examples via iterated Bregman projections. 

The notion of hard core sets is due to Impagliazzo ( 1995 ) . A crucial difference is that in the 
present work, the hard core is unique, maximal, and every weak learner does no better than random 
guessing upon a family of distributions supported on this set; in this cited work, the hard core 
is relaxed to allow some small but constant fraction correlation to the target. This relaxation is 
central to the work, which provides a correspondence between the complexity (circuit size) of the 
weak learners, the difficulty of the target function, the size of the hard core, and the correlation 
permitted in the hard core. 



2 Setup 



A view of boosting, which pervades this manuscript, is that the action of the weak learning class 



upon the sample can be encoded as a matrix ( Ratsch et al. 2001 Shalev-Shwartz and Singer 2008). 
Let a sample S :— {{xi,yi)}Y^ C (A" x 3^)™ and a weak learning class H be given. For every h GH, 
let S\h denote the negated projection onto S induced by h; that is, S\h is a vector of length to, 
with coordinates {S\h)i = —yih{xi). If the set of all such columns {S\h ■ h € H} is finite, collect 
them into the matrix A g M"'^". Let denote the i^^ row of A, corresponding to the example 
{xi,yi), and let {hj}^ index the set of weak learners corresponding to columns of A. It is assumed, 
for convenience, that entries of A are within [— 1,+1]; relaxing this assumption merely scales the 
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Figure 1: Exponential and logistic losses, plotted with linear and log-scale range. 



presented rates by a constant. 

The setting considered here is that this finite matrix can be constructed. Note that this can 
encode infinite classes, so long as they map to only k < oo values (in which case A has at most fc™ 
columns). As another example, if the weak learners are binary, and H has VC dimension d, then 
Sauer's lemma grants that A has at most (m + l)*^ columns. This matrix view of boosting is thus 



similar to the interpretation of boosting performing descent in functional space ( Mason et al. 2000 



Friedman et al. 2000 1 , but the class complexity and finite sample have been used to reduce the 



function class to a finite object. 

To make the connection to boosting, the missing ingredient is the loss function. 

Definition 2.1. Go is the set of loss functions g : K — >■ K satisfying: g is twice continuously 
differentiable, g" > 0, and lim2,_j._oo 9{x) — 0. 

For convenience, whenever g e Gq and sample size m are provided, let / : M™ — M denote the 
empirical risk function f{x) := -fo^' more properties of g and /, please see Appendix [c| 



The convergence rates of Section [6] will require a few more conditions, but Go suffices for all 
earlier results. 

Example 2.2. The exponential loss exp(-) (AdaBoost) and logistic loss ln(l + exp(-)) are both 
within Go (and the eventual G). These two losses appear in Figure [I] where the log-scale plot aims 
to convey their similarity for negative values. 

This definition provides a notational break from most boosting literature, which instead requires 
linij;_j.oo 5(2;) = (i.e., the exponential loss becomes exp(— a;)); note that the usage here simply 
pushes the negation into the definition of the matrix A. The significance of this modification is that 
the gradient of the empirical risk, which corresponds to distributions produced by boosting, is a 
nonnegative measure. (Otherwise, it would be necessary to negate this (nonpositive) distribution 
everywhere to match the boosting literature.) Note that there is no consensus on this choice, and 



2005) 



the form followed here can be found elsewhere (Boucheron et al. 

Boosting determines some weighting A S of the columns of A, which correspond to weak 
learners in T-l. The (unnormalized) margin of example i is thus (— ai,A) = —ej AX, where e^ is an 
indicator vector. (This negation is one notational inconvenience of making losses increasing.) Since 
the prediction on Xi is ^jhj{xi) > 0] = t[yi (a^. A) < 0], it follows that AX < 0^ (where 0^ is 

the zero vector) implies a training error of zero. As such, boosting solves the minimization problem 



inf 



,A))- inf 

AGR" 



giejAX) = inf f{AX) = inf (/ o A){X) =: /a; 



(2.3) 
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Routine Boost. 

Input Convex function / o A. 

Output Approximate primal optimum A. 

1. Initialize Aq :~ 0„. 

2. For t^l,2,..., while V(/ o A){Xt-i) ^ 0„: 

(a) Choose column (weak learner) 

jt ■■= argmax |V(/ o A)(Af_i)^ej|. 

j 

(b) Correspondingly, set descent direction vt € {ztej^}; note 

vJWif o A){\t-i) = -||V(/ o A){Xt-i)\\oc 

(c) Find at via approximate solution to the line search 

mi{foA)(\t-i+avt). 

a>0 

(d) Update Af := Xt-i + atVt. 

3. Return Xt-i- 



Figure 2: steepest descent (Boyd and Vandenberghe 2004 Algorithm 9.4) oi f o A 



recall / : K"* — )■ R is the convenience function f{x) = J^i 9{{^)i)i Etnd in the present problem denotes 
the (unnormalized) empirical risk, /a will denote the optimal objective value. 



The infimum in (2.3 1 may well not be attainable. Suppose there exists A' such that AX' < 0„ 



(Theorem 5.2 will show that this is equivalent to the weak learning assumption). Then 



0< inf /(AA)<inf /(A(cA'))=0. 

On the other hand, for any A £ M", f{AX) > 0. Thus the infimum is never attainable when weak 
learnability holds. 

The template boosting algorithm appears in Figure [2] formulated in terms of / o v4 to make the 
connection to coordinate descent as clear as possible. To interpret the gradient terms, note that 

m 

(V(/oA)(A)), = (ATV/(AA)), =-^g'((a.,A))/i,(a;.)2/„ 

which is the expected negative correlation of hj with the target labels according to an unnormalized 
distribution with weights g'{{ai,X)). The stopping condition V(/ o ^)(A) = Om means: either the 
distribution is degenerate (it is exactly zero), or every weak learner is uncorrelated with the target. 

As such. Boost in Figure [2] represents an equivalent formulation of boosting, with one minor 
modification: the column (weak learner) selection has an absolute value. But note that this is 
the same as closing H under complementation (i.e., for any h & H, there exists h'^^^ with h{x) — 
—h^~\x)), which is assumed in many theoretical treatments of boosting. 

In the case of the exponential loss and binary weak learners, the line search (when attainable) 
has a convenient closed form; but for other losses, and even with the exponential loss but with 
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confidence-rated predictors, there may not be a closed form. As such, Boost only requires an 
approximate line search method. Appendix [D] details two mechanisms for this: an iterative method, 
which requires no knowledge of the loss function, and a closed form choice, which unfortunately 
requires some properties of the loss, which may be difficult to bound tightly. The iterative method 
provides a slightly worse guarantee, but is potentially more effective in practice; thus it will be used 
to produce all convergence rates in Section |6] 

For simplicity, it is supposed that the best weak learner jt (or the approximation thereof encoded 
in A) can always be selected. Relaxing this condition is not without subtleties, but as discussed 
in Appendix |Ej there are ways to allow approximate selection without degrading the presented 
convergence rates. 

As a final remark, consider the rows {— fti}™ of —A as a collection of m points in M". Due to the 
form of g, BOOST is therefore searching for a halfspace, parameterized by a vector A, which contains 
all of these points. Sometimes such a halfspace may not exist, and g applies a smoothly increasing 
penalty to points that are farther and farther outside it. 



3 Dual Problem 



Applying coordinate descent to (2.3) represents a valid interpretation of boosting, in the sense that 



the resulting algorithm BOOST is equivalent to the original. However this representation loses the 
intuitive operation of boosting as generating distributions where the current predictor is highly 
erroneous, and requesting weak learners accurate on these tricky distributions. The dual problem 
will capture this. 

In addition to illuminating the structure of boosting, the dual problem also possesses a major 
concrete contribution to the optimization behavior, and specifically the convergence rates: the dual 
optimum is always attainable. 



The dual problem will make use of Fenchel conjugates ( Hiriart-Urruty and Lemarechal 2001 



Borwein and Lewis 20001; for any function h, the conjugate is 



f^* (4') — sup {x , (j)) — h{x) . 

xGdoni{h) 

Example 3.1. The exponential loss exp(-) has Fenchel conjugate 

(j!)ln(0) — (/) when (j) > 0, 
(exp(-))* ((/))=<( when(/) = 0, 

oo otherwise. 

The logistic loss ln(l + cxp(-)) has Fenchel conjugate 



(l-^)ln(l- 
(ln(l+cxp(.)))*(0)- <'0 



+ <j)ln{(f>) when (j) G (0, 1), 
when (j) e {0, 1}, 
otherwise. 



These conjugates are known respectively as the Boltzmann-Shannon and Fermi-Dirac entropies (see 
Borwein and Lewis 2000 closing commentary. Section 3.3). Please see Figurejsjfor a depiction. 

It further turns out that general members of Gq have a shape reminiscent of these two standard 
notions of entropy. 

Lemma 3.2. Let g e Gq he given. Then g* is continuously differentiable on int(dom((7*)), strictly 
convex, and either dom{g*) — [0, oo) or dom(g*) = [0,6] where b > 0. Furthermore, g* has the 
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Figure 3: Fenchel conjugates of exponential and logistic losses. 



following form: 



oo when < 0, 

when = 0, 

(-g(0),0) whence (0,g'(0)), 

-5(0) when0 = g'(O), 

(-5(0), oo] when > g'(0). 



(The proof is in Appendix [C|) There is one more object to present, the dual feasible set 
Definition 3.3. For any A g M™^", define the dual feasible set 

$A :=Ker(A^)nM™ 

Consider any ip S ^a- Since ip € Ker(A^), this is a weighting of examples which decorrelates all 
weak learners from the target: in particular, for any primal weighting A € M" over weak learners, 

AX = 0. And since V' G 1^+ j all coordinates are nonnegative, so in the case that ip 7^ {0,„}, this 
vector may be renormalized into a distribution over examples. The case — {Om} is an extremely 
special degeneracy: it will be shown to encode the scenario of weak learnability. 

Theorem 3.4. For any A e M™""" and g e Go with f{x) = Y.^9{{^)^)^ 

inf {/(AA) : A G M"} = sup{-/*(^) : V e , (3.5) 

where f*{(t)) — '^iLi9*{i't')i)- The right hand side is the dual problem, and moreover the dual 
optimum, denoted ip'^, is unique and attainable. 



(The proof uses routine techniques from convex analysis, and is deferred to Appendix |G.2| ) 
The definition of $^ does not depend on any specific g € Gq; this choice was made to provide 
general intuition on the structure of the problem for the entire family of losses. Note however that 
this will cause some problems later. For instance, with the logistic loss, the vector with every value 
two, i.e. 2 • Im, has objective value —f*{2 ■ 1^) — —00. In a sense, there are points in $^ which are 
not really candidates for certain losses, and this fact will need adjustment in some convergence rate 
proofs. 
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Remark 3.6. Finishing the connection to maximum entropy, for any g £ Gq, by Lemma 3.2 the 
optimum of the unconstrained problem is g'{0)lm, a rescaUng of the uniform distribution. But note 
that V/(AAo) — V/(0,„) = g'{0)lm- that is, the initial dual iterate is the unconstrai ned optimum! 



Let (f)t := Vf{AXt) denote the t*^ dual iterate; since Vf*{Vf{x)) = x (cf. Appendix lR2j), then for 
any ip £ Ker(A^), 

(vr(0O,^) = {AXt,i') = {Xt,A^^p) = 0. 
This allows the dual optimum to be rewritten as 

= argmin /* (-0) 

= argmin/* (V) - - (V/*(0O, V' " 0t) ; 

that is, the dual optimum tp^ is the Bregman projection (according to /*) onto $^ of any dual 
iterate 0t = \/f{AXt). In particular, tp^ is the Bregman projection onto the feasible set of the 
unconstrained optimum 0o = V/(AAo)! 
The connection to Bregman divergences runs deep; in fact, mirroring the development of BOOST 
as "compiling out" the dual variables in the classical boosting presentation, it is possible to compile 
out the primal variables, producing an algorithm using only dual variables, meaning distributions 



over examples. This connection has been explored extensively (Kivinen and Warmuth 1999 Collins 



etam2002 ). 

Remark 3.7. It may be tempting to use Theorem |3.4| to produce a stopping condition; that is, 
if for a supplied e > 0, a primal iterate A' and dual feasible t/j' e $^ can be found satisfying 
/{AX') + f*[ip') < e. Boost may terminate with the guarantee f{AX') — /a < £■ 

Unfortunately, it is unclear how to produce dual iterates (excepting the trivial 0^). If Kei{A^) 
can be computed, it suffices to P project V/(AAt) onto this subspace. In general however, not only 
is Ker{A^) painfully expensive to compute, this computation does not at all fit the oracle model of 
boosting, where access to A is obscured. (What is Ker(A^) when the weak learning oracle learns a 
size-bounded decision tree?) 



In fact, noting that the primal-dual relationship from ( |3.5| can be written 

inf {/(A) : A e Im(yl)} = sup{-/*(^') : ^ £ Ker(A^) = lm{A)^} 

(since dom(/*) C M™ encodes the orthant constraint), the standard oracle model gives elements of 
Im(A), but what is needed in the dual is an oracle for Ker(^^) = Im{A)-^. 



4 Generalized Weak Learning Rate 

The weak learning rate was critical to the original convergence analysis of AdaBoost, providing a 
handle on the progress of the algorithm. But to be useful, this value must be positive, which was 
precisely the condition granted by the weak learning assumption. This section will generalize the 
weak learning rate into a quantity which can be made positive for any boosting instance. 

Note briefly that this manuscript will differ slightly from the norm in that weak learning will 
be a purely sample-specific concept. That is, the concern here is convergence in empirical risk, and 
all that matters is the sample S — {{xi, j/i)}™, as encoded in A; it doesn't matter if there are wild 
points outside this sample, because the algorithm has no access to them. 

This distinction has the following implication. The usual weak learning assumption states that 
there exists no uncorrelating distribution over the input space. This of course implies that any 
training sample S used by the algorithm will also have this property; however, it suffices that there 
is no distribution over the input sample S which uncorrelates the weak learners from the target. 
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Returning to task, the weak learning assumption posits the existence of a positive constant, the 
weak learning rate 7, which lower bounds the correlation of the best weak learner with the target 
for any distribution. Stated in terms of the matrix A, 



< 7 



inf 



max 

+ ieH 
=1 



inf 

</>6M+\{0„} 



1^' 



inf 

+ \{o„ 



(4.1) 



Proposition 4.2. A boosting instance is weak learnahle iff '^a — {O-m}- 



Proof. Suppose $^ = {Om}; since the first infimum in (4.1) is of a continuous function over a 
compact set, it has some minimizer 0'. But ||0'||i = 1, meaning <})' ^ <i>^, and so ||j4^0'||oo > 0. On 
the other hand, if "I>^ 7^ {Om}, take any 0" e \ {0,n}; then 



< 7 



inf 

0GR!"\{O™} 



< 



\A 



0. 



□ 



Following this connection, the first way in which the weak learning rate is modified is to replace 
{0„} with the dual feasible set $a = Ker(A^) n For reasons that will be sketched shortly, but 
fully dealt with only in Section |6j it is necessary to replace M™ with a more refined choice S. 

Definition 4.3. Given a matrix A e M"""" and a set 5" C M™, define 



7(A,S') :=inf 



inf. 



lAeSnKcr(AT) 



e 5 \ Ker(yl^ 







First note that in the scenario of weak learnability (i.e., $a = {Om} by Proposition 4.2), the 
choice S = M™ allows the new notion to exactly cover the old one: ^{A, M™) = 7. 

To get a better handle on the meaning of S, first define the following projection and distance 
notation to a closed convex nonempty set C, where in the case of non-uniqueness (l^ and l°°), some 
arbitrary choice is made: 



P^(x) e Argmin \\y - x\\p, 



Suppose, for some t, that V/(AAt) G S \ Ker(^^); then the infimum within j{A, S) may be instan- 
tiated with Wf{AXt), yielding 



7(A, S) = inf 

0eS\Kor(AT) 



SnKcr{A 



< 



\A-^VfiAXt 



T)(0)||i - ||V/(AA,)-PinKcr(AT)(V/(^A,))|h 



Rearranging this. 



7(A,5) V/(AAO 



SnKcr(A1 



)(V/(AAO) 



< 



|A^V/(AAi)||, 



(4.4) 



(4.5) 



This is helpful because the right hand side appears in standard guarantees for single-step progress in 
descent methods. Meanwhile, the left hand side has reduced the influence of A to a single number, 
and the normed expression is the distance to a restriction of dual feasible set, which will converge 
to zero if the infimum is to be approached, so long as this restriction contains the dual optimum. 
This will be exactly the approach taken in this manuscript; indeed, the first step towards con- 



vergence rates, Proposition 6.2 will use exactly the upper bound in (4.5). The detailed work that 



remains is then dealing with the distance to the dual feasible set. The choice of S will be made to 
facilitate the production of these bounds, and will depend on the optimization structure revealed in 
Section [5l 

In order for these expressions to mean anything, ^{A, S) must be positive. 
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Theorem 4.6. Let matrix A £ IR™>^" and polyhedron S C M™ be given with S \ Ker(A^) 7^ and 
S n Ker(yl^) ^ 0. Then j{A, S) > 0. 

The proof, material on other generalizations of 7, and discussion on the polyhedrality of S can 
all be found in Appendix [F] 

As a final connection, since ^^P5nKcr(yiT) ('?^) ~ ^ot'^ that 

71 A, 5)= ml 7— , ,,,, = mt — — j . 

0eS\Kcr(AT) ||(/)- P^nKcr(AT)('/')lll 0e5\Kcr(AT) || </> " P^nKcr(AT ) 1 1 1 

In this way, 7(A, S) resembles a Lipschitz constant, reflecting the effect of A on elements of the dual, 
relative to the dual feasible set. 



5 Optimization Structure 



The scenario of weak learnability translates into a simple condition on the dual feasible set: the dual 
feasible set is the origin (in symbols, = Ker(yl^)nM™ = {Om})- And how about attainability — is 
there a simple way to encode this problem in terms of the optimization problem? 

This section will identify the structure of the boosting optimization problem both in terms of the 
primal and dual problems, first studying the scenarios of weak learnability and attainability, and 
then showing that general instances can be decomposed into these two. 

There is another behavior which will emerge through this study, motivated by the following 
question. The dual feasible set — Ker(A^) nM™ is the set of nonnegative weightings of examples 
under which every weak learner (every column of A) has zero correlation; what is the support of 
these weightings? 

Definition 5.1. H{A) denotes the hard core of A: the collection of examples which receive positive 
weight under some dual feasible point, a distribution upon which no weak learner is correlated with 
the target. Symbolically, 

H{A) {i e H : 3^ e (^), > 0}. 



One case has already been considered; as established in Proposition 4.2 weak learnability is 
equivalent to $^ = {0„j}, which in turn is equivalent to |iJ(yl)| = 0. But it will turn out that other 
possibilities for H{A) also have direct relevance to the behavior of BOOST. Indeed, contrasted with 
the primal and dual problems and feasible sets, H{A) will provide a conceptually simple, discrete 
object with which to comprehend the behavior of boosting. 



5.1 Weak Learnability 

The following theorem establishes four equivalent formulations of weak learnability. 
Tiieorem 5.2. For any A G JJ™^" and g £ Gq the following conditions are equivalent: 

(5.2.1) 3AeM".AAeM™_, 

(5.2.2) infAeK. f{A\) = 0, 



(5.2.3) V 



A — Om; 



(5.2.4) - {0„}. 



equivalently \H{A)\ = 0. 



First note that (5.2.4) indicates (via Proposition 4.2) this is indeed the weak learnability setting. 
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Figure 4: Geometric view of the primal and dual problem, under weak learnability. The vertices of 
the pentagon denote the points {— 0^}™. The arrow, denoting A in (5.2.1), defines a homogeneous 
halfspace containing these points; on the other hand, their convex huU does not contain the origin. 
Please see Theorem 



5.2 



and its discussion. 



Recall the earlier discussion of boosting as searching for a halfspace containing the points 



{— a-i}™ — {~ejA}^; property (5.2.1) encodes precisely this statement, and moreover that there 



exists such a halfspace with these points interior to it. Note that this statement also encodes the 



margin separability equivalence of weak learnability due to Shalev-Shwartz and Singer ( 2008 ) ; specif- 
ically, if labels are bounded away from and each point —Ui (row of — A) is replaced with —yiUi, the 
definition of A grants that positive examples will land on one side of the hyperplane, and negative 
examples on the other. 

The two properties (5.2.4) and (5.2.1) can be interpreted geometrically, as depicted in Figure |4] 
the dual feasibility statement is that no convex combination of {— Oi}™ will contain the origin. 



Next, (5.2.2) is the (error part of the) usual strong PAC guarantee (Schapire 1990): weak 
learnability entails that the training error will go to zero. And, as must be the case when = {0^,}, 



property (|5.2.3|) provides that i/A = 



Proof of Theorem\5^ (( |5.2.1[ ) =^ ( |5.2.2| ) Let A G M" be given with ^A G ]R™_, and let any 
increasing sequence {ci}^" f 00 be given. Then, since / > and lim2._j._00 gi^) ^ 0, 

inf /(AA) < lim /(c,v4A) = < M f{AX). 

X 2— >oo A 



((5.2.2) 



( 5.2.3 1) The point 0„i is always dual feasible, and 

0=-f*{Om). 



inf/(AA) 



Since the dual optimum is unique (Theorem 3.4), — 



((5.2.3) ( 5.2.4 1) Suppose there exists ^ € with -0 ^ Om- Since — /* i s con tinuous and 

increasing along every positive direction at 0„j — ip'^ (see Lemma 3.2 and Lemma C.2), there must 



exist some tiny r > such that —f*{TiJ)) > — /*(V'a)' contradicting the selection of -0;^ as the unique 



optimum. 



((5.2.4) 



(5.2.1 1) This case is directly handled by Gordan's theorem (cf. Theorem B.l ). □ 



5.2 Attainability 

For strictly convex functions, there is a nice characterization of attainability, which will require the 
following definition. 
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Definition 5.3 (cf. [Hiriart-Urruty and Lemarechal (2001 Definition B.3.2.5)). A closed convex 
function h is called 0-coercive when all level sets are compact. (That is, for any a G M, the set 
{x : f{x) < a} is compact.) 



Proposition 5.4. Suppose h is dijferentiable, strictly convex, and dom(/i) 
is attainable iff h is 0-coercive. 



Then inix h[x) 



Note that 0-coercivity means the domain of the infimum in (2.3 1 can be restricted to a compact 



set, and attainability in turn follows just from properties of minimization of continuous functions on 
compact sets. It is the converse which requires some structure; the proof however is unilluminating 
and deferred to Appendix |G.3[ 

Armed with this notion, it is now possible to build an attainability theory for f o A. Some care 
must be taken with the above concepts, however; note that while / is strictly convex, f o A need 
not be (for instance, if there exist nonzero elements of Ker(A), then moving along these directions 
does not change the objective value). Therefore, 0-coercivity statements will refer to the function 



(/ 



\/ \ \ fi^) when x e Im(A), 
' ~" otherwise. 



oo 



This function is effectively taking the epigraph of /, and intersecting it with a slice representing 
Im(>l) = {AX : A S E"}, the set of points considered by the algorithm. As such, it is merely a 
convenient way of dealing with Ker(A) as discussed above. 

Theorem 5.5. For any A £ and g G Gq, the following conditions are equivalent: 

(5.5.1) VA e M" . AA ^ M™ \ {0„}, 

(5.5.2) / + '-im(A) is 0-coercive, 

(5.5.3) ^{ e 

(5.5.4) $AnM!p+7^0. 



Following the discussion above, (5.5.2) is the desired attainability statement. 

Next, note that (5.5.4) is equivalent to the expression |77(^)| = m, i.e. there exists a distribution 



with positive weight on all examples, upon which every weak learner is uncorrelated. The forward 
direction is direct from the existence of a single ip g $^ n For the converse, note that the tpi 

corresponding to each i £ H{A) can be combined into ip — J^i V'i G Ker(A^) n M™^ (since Ker(yl^) 
is a subspace). 

For a geometric interpretation, consider (5.5.1) and ( 5.5.4[ ). The first says that any halfspace 
containing some —a.i within its interior must also fail to contain some — Oj (with i ^ j). (Property 



(5.5.11 also allows for the scenario that no valid enclosing halfspace exists, i.e. A — 0„.) The latter 
states that the origin 0,„ is contained within a positive convex combination of {— a^}™ (alternatively, 
the origin is within the relative interior of these points) . These two scenarios appear in Figure [5] 
Finally, note (5.5.3): it is not only the case that there are dual feasible points fully interior to 



M™, but furthermore the dual optimum is also interior. This will be crucial in the convergence rate 
analysis, since it will allow the dual iterates to never be too small. 



Proof of Theorem\5j[ ((|5.5.1[) =^ (|5.5.2[)) Let d£W^\ {0 ^} and A G M" be arbitrary To show 
0-coercivity, it suffices ( [Hiriart-Urruty and Lemarechal[ 2001 Proposition B.3.2.4.iii) to show 



lim 



fiAX + td) + n^^A) [AX + td) - f{AX) 



> 0. 



(5.6) 
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Figure 5: Geometric view of the primal and dual problem, under attainability. Once again, the 
{— Oi}™ are the vertices of the pentagon. This time, no (closed) homogeneous halfspace containing 
all the points will contain one strictly, and the relative interior of the pentagon contains the origin. 
Please see Theorem 



5.5 



and its discussion. 



If d ^ Im(A) (and t > 0), then tiin(A){AX + td) = oo. Suppose d G Im(A); by (5.5.1), since d ^ 0^, 
then d ^ M™, meaning there is at least one positive coordinate j. But then, since g > and g is 
convex, 



(5.61 > lim 



> hm 

t^OO 



g{eJ{AX + td))-f{AX) 
t 

g{e]A\) + td,g'{e]A\) - f{A\) 



^d,g'{e]AX), 
which is positive by the selection of dj and since g' > 0. 



((5.5.2) (5.5.3)) Since the infimum is attainable, designate any A satisfying mix f [AX] = 

f{AX) (note, although / is strictly convex, f o A need not be, thus uniqueness is not guaranteed!). 
The optimality conditions of Fenchel problems may be applied, meaning V'^l = V/(^^)j which is 
interior to R™ since V/ £ K!p_|_ everywhere (cf. Lemma C.2). (For the optimality conditions, see 



Borwein and Lewis (2000 Exercise 3.3.9.f), with a negation inserted to match the negation inserted 
within the proof of Theorem 3.4 ) 



(( 5.5.3 ) 
((5.5.4) 



( |5.5.4[ )) This holds since $^ 3 {^I'j^} and i/'a 



(5.5.1 1) This case is directly handled by Stiemke's Theorem (cf. Theorem B.4). □ 



5.3 General Setting 

So far, the scenarios of weak learnability and attainability corresponded to the extremal hard core 
cases of |iJ(A)| e {Q.m}. The situation in the general setting 1 < |iJ(^)| < m — 1 is basically as 
good as one could hope for: it interpolates between the two extremal cases. 
As a first step, partition A into two submatrices according to H{A). 

Definition 5.7. Partition A e M'"^" by rows into two matrices Aq e E™oxn ^ M"+x", 

where A^ has rows corresponding to H(A), and m+ = |_ff(74)|. For convenience, permute the 
examples so that 

A^ li"' . 
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(This merely relabels the coordinate axes, and does not change the optimization problem.) Note 
that this decomposition is unique, since H(A) is uniquely specified. 

As a first consequence, this partition cleanly decomposes the dual feasible set $^ into &nd 



Proposition 5.8. For any A e 



Furthermore, no other partition of A into Bq G 



id B+ e 



satisfies these properties. 



5.5 



simply add together, 



Proof. It must hold that = {Omoi' since otherwise there would exist tp G KeT{Aj) n with 
-0 7^ Ojno , which could be extended to ip' = i/; x 0^^ S ^a and the positive coordinate of ip could 
be added to H{A), contradicting the construction of H{A) as including a ll su ch rows. 

The property $a+ nR+^ ^ was proved in the discussion of Theorem] 
for each i £ H{A), the 'i/'i's corresponding to positive weight on i. 

For the decomposition, note first that certainly every ip e ^Aq ^ satisfies ip E Now 
suppose contradictorily that there exists tp' G ^a \ {^Ag ^ ^a+)- There must exist j G [m] \ H{A) 
with > 0, since otherwise ip' G {0^} x but that means j should have been included in 

H{A), a contradiction. 

For the uniqueness property, suppose some other Bq, B^ is given, satisfying the desired properties. 
It is impossible that some € B^ is not in H{A), since any ip & can be extended to ip' e $^ 
with positive weight on i, and thus is included in H{A) by definition. But the other case with 
i G H{A) but Oi e Bq is equally untenable, since the corresponding measure 0; is in ^a but not in 



The main result of this section will have the same two main ingredients as Proposition 5.8 



• The full boosting instance may be uniquely decomposed into two pieces, and j4-|_, each of 
which individually behave like the weak learnability and attainability scenarios. 

• The subinstances have a somewhat independent effect on the full instance. 



Theorem 5.9. Let g e Gq and A £ M™x" be given. Let Bq e 
of A by rows. The following conditions are equivalent: 



(5.9.1) 
(5.9.2) 

(5.9.3) 
(5.9.4) 



3X e M" . BqX e 



A Si A = 0, 



5+ e RP^" be any partition 
and VA e M" . B+X ^ \ {0^}, 



infAGR" f{AX) = infAGR" f(,B+X), and infAgR" /(-BqA) = 0, 
and / + iiin(B+) is 0-coercive, 



with ipp. 



and ipi e , , 



^Bo={Oz}, and $B^nM^+^ 



Stepping through these properties, notice that (5.9.41 mirrors the expression in Proposition 5 



But that proposition also granted that this representation was unique, thus only one partition of A 
satisfies the above properties, namely Aq, A^. Since this Theorem is stated as a series of equivalences, 
any one of these properties can in turn be used to identify the hard core set H{A). 

To continue with geometric interpretations, notice that (5.9.1 ) states that there exists a halfspace 



strictly containing those points in [to] \H{A), with all points of H{A) on its boundary; furthermore, 
trying to adjust this halfspace to contain elements of H{A) will place others outside it. With regards 
to the geometry of the dual feasible set as provided by ( 5.9.4 ), the origin is within the relative interior 
of the points corresponding to H{A), however the convex hull of the other to — points can 
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Figure 6: Geometric view of the primal and dual problem in the general case. There is a closed 
homogeneous halfspace containing the points {— a^}™, where the hard core lies on the halfspace 
boundary, and the other points are within its interior; moreover, there does not exist a closed 
homogeneous halfspace containing all points but with strict containment on a point in the hard 
core. Finally, although the origin is in the convex hull of {— Oi}™, any such convex combination 
places zero weight on points outside the hard core. Please see Theorem 15.91 and its discussion. 



not contain the origin. Furthermore, if the origin is written as a convex combination of all points, 
this combination must place zero weight on the points with indices [m] \ H{A). This scenario is 
depicted in Figur e [6} 



rem 



In properties (5.9.21 and (5.9.3), Bq mirrors the behavior of weakly learnable instances in Theo- 



5.2 and analogously i?+ follows instances with minimizers from Theorem 5.5 The interesting 



addition, as discussed above, is the independence of these components: (5.9.2) provides that the 



infimum of the combined problem is the sum of the infima of the subproblems, while ( 5.9.3 ) provides 



that the full dual optimum may be obtained by concatenating the subproblems' dual optima. 



Proof of Theorem\5^ (( |5.9.1[ ) => ( |5.9.2| ) Let A be given with BqA G Mi_ and B+\ = Op, and let 
t oo be an arbitrary sequence increasing without bound. Lastly, let {Xi}'^ be a minimizing 
sequence for infA f{B+\). Then 



inf/(B+A) = lim (/(B+A, 

A l—^QO 



fic,Bo~X))>Mf{AX) 



inf(/(i3+A) + /(BoA)) > inf /(i?+A), 



which used the fact that /(BqA) > since / > 0. And since the chain of inequalities starts and 
ends the same, it must be a chain of equalities, which means inf /(BqA) = 0. To show 0-coercivity 
of / + tim(B+)7 note the second part of (5.9.1) is one of the conditions of Theorem 



Thus 



5.5 



{K9^ ([5J3|) First, by Theorem 5.2 infA f{BoX) = means = 0, and = {Oj 



sup 

ipe<s>A 



sup{-/*(V',)-r(V'p) iV-. + BjVp = o„} 



> sup -f*{iJz 
/,/ \ _ 



sup 



= - = ml f{B+X) = ini f{AX) = -f*^. 
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Combining this with f*{x) ~ J2i g((^)») ^-^id g*{0) 



). But Theorem 



obtain ip 



f 



((5.9.3) 



use Theorem 5.5 with the 0-coercivity of / + iini(B+)- 



3.4 



(ef. Lemma 3.2 and Theorem 3.4), f*{^j{^) 



shows ipj^ was unique, which gives the result. And to 



(5.9.4|) Since i/;;^^^ = 0^, it foUows by Theorem 5.2 that $b„ = {0^}. Furthermore, 
, it fohows that HM^^ ^ 0. Now suppose contradictorily that $a ^Bo ^^b+] 



since V's^ G I 

since it always holds that ^ ^Bq x '&_b+ , this supposition grants the existence of ijj ~ 
where G K+ \ {0 J. 

Consider the element q :— tp + which has more nonzero entries than but still q G 
since ^a is a convex cone. Let /g index the nonzero entries of q, and let Aq be the restriction of 
A to the rows Iq. Since g € <i>yi, meaning g is nonnegative and q £ Ker(74^), it follows that the 
restriction of q to its positive entries is within Kei{Aj) (because only zeros of q and matching rows 
of A are removed, dot products between q with rows of are the same as dot products between the 
restriction of q and rows of Aj ) , and so g e $^ , meaning ^a H m}!''} is nonempty. Correspondingly, 



by Theorem 



5.5 



the dual optimum ijj^ of this restricted problem will have only positive entries. 



But by the same reasoning granting that q restricted to Iq is within , it follows that the full 
optimum ip^, restricted to Iq, must also be within ^a^ (since, by q's construction, ip^'s zero entries 
are a superset of the zero entries of q). Therefore this restrictio n of -0^ to Iq will have at least one 



zero entry, meaning it can not be equal to ip^ ; but Theorem 



3.4 



provided that the dual optimum 



is unique, thus —f*{ip'^ ) > —f*{ijj'^). Finally, produce tp-^ from -0^ by inserting a zero for each 
entry of Iq; the same reasoning that allows feasibility to be maintained while removing zeros allows 



— f 

them to be added, and thus iJjj^ 



E ^A- But this is a contradiction: since g*{0) = (cf. Lemma 



3.21 



both and the optimum ipy^ have zero contribution to the objective along the entries outside of 



Iq, and tlius 



meaning ip^ is feasible and has strictly greater objective value than the optimum ip 
tion. 



^, a contradic- 



((5.9.4) 



(5.9.1 1) Unwrapping the definition of <^a, the assumed statements imply 



(V0O eM;\{oj,0+ e 



0<t>0 



BlcP+ ^ 0„) A (30+ e . Bl(P+ - 0„). 



Applying Motzkin's transposition theorem (cf. Theorem B.7) to the left statement and Stiemke's 
theorem (cf. Theorem B.4 which is implied by Motzkin's theorem) to the right yields 



(3A e M" . BqX e Mi_ A B+X e rp_) a (VA e M" . B+X ^ RP_ \ {Op}), 



which implies the desired statement. 



□ 



Remark 5.10. Notice the dominant role A plays in the structure of the solution found by boosting. 
For every i e [m] \ II{A), the corresponding dual weights go to zero (i.e., {\/ f{AXt))i i 0), and the 
corresponding primal margins grow unboundedly (i.e., —ejAXt t 00, since otherwise mix fi^oX) > 
0). This is completely unaffected by the choice oi g G Gq. Furthermore, whether this instance is 
weak learnable, attainable, or neither is dictated purely by A (respectively |i?(^)| — 0, |i?(^)| — m, 
or \H{A)\ e [l,m- 1]). 

Where different loss functions disagree is how they assign dual weight to the points in II{A). 
In particular, each g £ Go (and corresponding /) defines a notion of entropy via /*. The dual 



optimization in Theorem 3.4 can then be interpreted as selecting the max entropy choice (per /*) 
amongst those convex combinations of II[A) equal to the origin. 
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6 Convergence Rates 



Convergence rates will be proved for the following family of loss functions. 

Definition 6.1. G contains all functions g satisfying the following properties. First, g G Go- Second, 
for any x £ M™ satisfying f{x) < f{AXo) = mg{0), and for any coordinate {x)i, there exist constants 
77 > and /3 > such that g"{{x)i) < r]g{{x)i) and g{{x)i) < fig'{{x)i). 

The exponential loss is in this family with rj = /3 = 1 since exp(-) is a fixed point with respect 
to the differentiation operator. Furthermore, as is verified in Remark |G.1[ the logistic loss is also 
in this family, with rj — 2™/(to ln(2)) and /3 = 1 + 2™ (which may be loose). In a sense, ij and (3 
encode how similar some g e G is to the exponential loss, and thus these parameters can degrade 
radically. However, outside the weak learnability case, the other terms in the bounds here can also 
incur a large penalty with the exponential loss, and there is some evidence that this is unavoidable 



(see the lower bounds in Mukherjee et al. (2011) or the upper bounds in Ratsch et al. (2001)). 

The first step towards proving convergence rates will be to lower bound the improvement due to 
one iteration. As discussed previously, standard techniques for analyzing descent methods provide 
such bounds in terms of gradients, however to overcome the difficulty of unattainability in the primal 



space, the key will be to convert this into distances in the dual via "f{A, S), as in (4.5). 
Proposition 6.2. For any t, g eG, Ae E™><", and S ^ {V/(AAt)} mth 'y{A, S) > 



fiAXt+i)-fA<fiAXt)-fA 



6vf{AXt 



Proof. The stopping condition grants Vf{AXt) ^ Ker(yl^). Proceeding as in (4.4), 



^(A S) = inf H-^^'^lloo < WA'^fiAXMoo 

^e^\K-(^")DWc.(AT)W - D^nKer(AT)(V/(AAO)- 

Combined with the approximate line search guarantee from Proposition |D.6| 

fiAX,) - /(^A.+O > 6,/(^A,) ^ er^AX;) ' 

Subtracting /a from both sides and rearranging yields the statement. □ 

The task now is to manage the dual distance D^^^^^^^^r^i^ f{AXt)), specifically to produce a 
relation to f{AXt) — /a, the total suboptimality in the preceding iteration; from there, standard 
tools in convex optimization will yield convergence rates. Matching the problem structure revealed 
in Section [5j first the extremal cases of weak learnability and attainability will be handled, and 
only then the general case. The significance of this division is that the extremal cases have rate 
C'(ln(l/e)), whereas the general case has rate 0{l/e) (with a matching lower bound provided for 
the logistic loss). The reason, which will be elaborated in further sections, is straightforward: the 
extremal cases are fast for essentially opposing regions, and this confiict will degrade the rate in the 
general case. 

6.1 Weak Learnability 

Theorem 6.3. Suppose \H{A)\ = and g € G; then 7(A,M™) > 0, and for any t>0, 

7(A,E™)2\ * 



f{AXt) < f{AXo) 1 
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Proof. By Theorem |5.2[ — {0™}, meaning 



DljWfiAXt)) 



inf 



|V/(AA0-V|1 



Next, M™ is polyhedral, and Theorem 
Theorem 



4.6 



5.2 



grants 



provides 7(A,M™) > 0. Since Vf{A\t] 
met, and using fA — (again by Theorem 5.2), 



i^\\Vf{AXt,)h>f{AXt)/p. 

n Kcr(A^) 7^ and M™ \ Ker(A^) ^ 
G M!p, all conditions of Proposition 



6.2 



so 
are 



f{AXt+,) < fiAXt) - 



'fiAXtf 



6f3^r^f{AXt) 

and recursively applying this inequality yields the result 



= fiAXt) 1 - 



7(A,M'p)^ 
6/3^r/ 



(6.4) 



□ 



rate. 



As discussed in Section |4j 7(A,M™) 



7, the latter quantity being the classical weak learning 

Specializing this analysis to the e xponential loss (where rj = f3 — I), the bound becomes (1 — 
7^/6)*, which recovers the bound of Schapire and Singer (1999), although with vastly different 
analysis. (The exact expression has denominator 2 rather than 6, which can be recovered with the 
closed form line search; cf. Appendix [d|) 



In general, solving for t in the expression 

f{AXt) - Ja 



f{AXo) - fA 



< 1- 



6/32?? 



< cxp — 



reveals that t < ^^^2^1n(l/e) iterations suffice to reach suboptimality e. Recall that /3 and rj, in 
the case of the logistic loss, have only been bounded by quantities like 2™. While it is unclear if 
this analysis of /3 and rj was tight, note that it is plausible that the logistic loss is slower than the 
exponential loss in this scenario, as it works less in initial phases to correct minor margin violations. 
Remark 6.5. The rate C'(ln(l/e)) depended crucially on both g < f3g' and g" < rjg. If for in- 
stance the second inequality were replaced with g" < C, then (6.4) would instead have form 

would grant a rate 



B.ll 



f{AXt+i) < f{AXt) — f{AXt)'^0{l), which by an application of Lemma 
0(l/e). For functions which asymptote to zero (i.e., everything in <Go), satisfying this milder second 
order condition is quite easy. The real mechanism behind producing a fast rate is g < f3g' , which 
guarantees that the flattening of the objective function is concomitant with low objective values. 



6.2 Attainability 

Consider now the case of attainability. Recall from Theorem [53] and Proposition |5 .4| that attainabil- 
ity occurred along with a stronger property, the 0-coercivity (compact level sets) of f + Lim{A) (it was 
not possible to work with f o A directly, which will have unbounded level sets when Ker(yl) ^ 0„). 

This has an immediate consequence to the task of relating f{AXt) — fA to the dual distance 
'^SnKcr(yiT) (^■^(^'^*))" / i^ ^ strictly convex function, which means it is strongly convex over any 
compact set. Strong convexity in the primal corresponds to upper bounds on second derivatives 
(occasionally termed strong smoothness) in the dual, which in turn can be used to relate distance 
and objective values. This also provides the choice of polyhedron S in j{A, S): unlike the case of 
weak learnability, where the unbounded set M™ was used, a compact set containing the initial level 
set will be chosen. 

Theorem 6.6. Suppose \H{A)\ = m and g G G. Then there exists a (compact) tightest axis-aligned 
rectangle C containing the initial level set {x G M™ : (/ + i'im{A))ix) < fiAXa)}, and f is strongly 
convex with modulus c > over C. Finally, either Xq is optimal, or j{A, V/(C)) > 0, and for all t. 
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As in Section |6.1[ when Aq is suboptimal, this bound may be rearranged to say that t < 
i-^^^j/'^fl^Qy lii(l/e) iterations suffice to reach suboptimafity e. 

To make sense of this bound and its proof, the essential object is C, whose properties are cap- 
tured in the following lemma, which is stated with some slight generality in order to allow reuse in 
Section [Ol 



Lemma 6.7. Let g e G, A G MJ"^" with \H{A) \ = m, and any d > inf^ f{AX) be given. Then there 
exists a (compact nonempty) tightest axis-aligned rectangle C 3 {a; S M™ ; (/ + tiin(A))(2;) ^ d}. 
Furthermore, the dual image \/f(C) C M™ is also a (compact nonempty) axis-aligned rectangle, and 
moreover it is strictly contained within dom(/*) C M™. Finally, V f{C) contains dual feasible points 

(i.e., v/(C)n$A^0;. 



A full proof may be found in Appendix G.4 the principle is that |iJ(A)| = m provides 0-coercivity 

Tel S 



of f + iim(A)i Slid thus the initial level set is compact. To later show ^{A, S) > via Theorem 



must be polyhedral, and to apply Proposition 6.2 it must contain the dual iterates {W f{A\t)] 



the easiest choice then is to take the bounding box C of the initial level set, and use its dual map 
V/(C). To exhibit dual feasible points within V/(C), note that C will contain a primal minimizcr, 
and optimality conditions grant that V f{C) contains the dual optimum. 

With the polyhedron in place. Proposition |6.2| may be applied, so what remains is to control the 
dual distance. Again, this result will be stated with some extra generality in order to allow reuse in 
Section [Ol 



Lemma 6.8. Let A e M™^", g e <G, and any compact set S with V f{S) n Kcr(A^) ^ % be given. 
Then f is strongly convex over S , and taking c > to be the modulus of strong convexity, for any 
X G 5'nlm(A), 

fix)~fA<\- inf ||V/(x)-V||2. 

Before presenting the proof, it can be sketched quite easily. Using the Fenchel- Young inequality 



(cf. Proposition B.IO) and the form of the dual optimization problem (cf. Theorem 3.4), primal 
suboptimality can be converted into a Brcgman divergence in the dual. If there is strong convexity 
in the primal, it allows this Bregman divergence to be converted into a distance via standard tools in 



convex optimization (cf. Lemma B.12). Although / lacks strong convexity in general, it is strongly 
convex over any compact set. 

Proof of Lemma \6.^ Consider the optimization problem 

m 

inf inf (vV(a;)0» = inf inf V ; 



= 1 M 2 = 1 



i=l 



since S is compact and g" and (•)^ are continuous, the infimum is attainable. But g" > and 
(j) Om, meaning the infimum c is nonzero, and moreover it is the modulus of strong convexity of / 
over S ( [Hiriart-Urruty and Lemarechal 2001 Theorem B.4.3.1.iii). 

Now let any x G S D lm{A) be given, define D = Wf{S) C M™, and for convenience set 
K := Ker(A^). Consider the dual element P'ddkC^ fi^)) (which exists since D D K (d); due 



to the projection, it is dual feasible, and thus it must follow from Theorem 3.4 that 

fA = sup{-r (^) : V e > ~r (PlnKi^fi^))) ■ 
Furthermore, since x e Im(yl), 

(^,PW(V/(x)))=0. 
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Combined with the Fenchel- Young inequahty (cf. Proposition 



B.lOl and X = Vf*{Vf{x)), 



fix) -Ia< fix) + f* (PlnA'^fix))) 

= f* iPlnA^ fix))) + {yfix),x) - /*(V/(x)) 

= r (PW(V/(a:)))-r (V/(:r)) - (V/* (V/(x)), P|,ni^(V/(x))- V/(x)) 



< ^||V/(a;)-P?,nK(V/(x))||2, 



(6.9) 
(6.10) 



B.12 



noting that both V/(x) and P|)nif (^/(^)) 



where the last step follows by an application of Lemma 1 
are in V/(S') = I?, and / is strongly convex with modulus c over S. To finish, rewrite P as an infimum 
and use |1 ■ II2 < || • ||i- □ 

The desired result now follows readily. 

Proof of Theorem\6. 



Invoking Lemma 6.7 with d 



fiAXg) immediately provides a compact tight- 
est axis-aligned rectangle C containing the initial level set S :— {x € M™ : (/-|-iim(A))(a;) < fi^^o)}- 
Crucially, since the o bjec tive values never increas e, S and C contain every iterate {^Af}^^- 
Applying Lemma 6.8 to the set C (by Lemma 6.7 V/(C) n Ker(A^) 7^ 0), then for any t, 



fiAXt) -h< ^l|V/(AAO ~ P^/(c)nKcr(AT)(V/(AA,))| 



where c > is the modulus of strong convexity of / over C. 

Finally, if there are suboptimal iterates, then V/( C) g V/(S') contains points that are not dual 
feasible, meaning V/(C) \ Kcr(A^) 7^ 0; since Lemma [ej] also provided V/(C) C^<^A^^% and V/(C) 
is a hypercube, it follows by Theorem 4.6 that 7(^, V/(C)) > 0. Plugging this into Proposition 6.2 
and using fiAXt) < fiAXo) gives 



fiAXt+i) -fA< fiAXt) - fA 



< ifiAXt) - /a) 1 



7(^,V/(C))^DW(c)nKc..(AT)(V/(^A0)^ 
GvfiAXt) 

cjiAjmi' 

3r//(AAo) 



and the result again follows by recursively applying this inequality. 



□ 



Remark 6.11. The key conditions on g £ G, namely the existence of constants granting g < (3g' 
and g" < rjg within the initial level set, are much more than are needed in this setting. Inspecting 
the presented proofs, it entirely suffices that on any compact set in M™, / has quadratic upper and 
lower bounds (equivalently, bounds on the smallest and largest eigenvalues of the Hessian), which 



are precisely the weaker conditions used in previous treatments (Bickel et al. 2006 Ratsch et al 



2001) 



These quantities are therefore necessary for controlling convergence under weak learnability. 
To see how the proofs of this section break down in that setting, consider the central Bregman 
divergence expression in (6.9). What is really granted by attainability is that every iterate lies well 
within the interior of dom(/*), and therefore these Bregman divergences, which depend on V/*, 
can not become too wild. On the other hand, with weak learnability, all dual weights go to zero 
(cf. Theorem 5.2), which means that Vg* f 00, and thus the upper bound in (6.10) ceases to be 
valid. As such, another mechanism is required to control this scenario, which is precisely the role of 
g < Pg' and g" < 775. 
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6.3 General Setting 



The key development of Section [0| was that general instances may be decomposed uniquely into two 
smaller pieces, one satisfying attainability and the other satisfying weak learnability, and that these 
smaller problems behave somewhat independently. This independence is leveraged here to produce 
convergence rates relying upon the existing rate analysis for the attainable and weak learnable 
cases. The mechanism of the proof is as straightforward as one could hope for: decompose the dual 
distance into the two pieces, handle them separately using preceding results, and then stitch them 
back together. 



Theorem 6.12. Suppose g E 
the rows of A into Aq S 



,nd 1 < < m — 1. Recall from Section 5.3 the partition of 

id A+ e M™+ 



A 



A+ 



and suppose the axes ofW"^ are ordered so that 

Set C+ to be the tightest axis-aligned rectangle C+ ^ {x E : (/ + ^1111(^1+)) (2^) 1^ 

f{AXa)}, andw := sup^ ||V/(yl+At) - P^^j^^^^j^^^.j^T)(V/(^+At))||i. ThenC+ is compact, w <oo, 

f has modulus of strong convexity c > overCj^., and 7(A,M™'' x V/(C+)) > 0. Using these terms, 
for all t. 



< 



2/(AAo) 



(t + 1) min {1, 7(A, x V/(C+))V(3r;(/3 + w/{2c)y)} ' 



The new term, w, appears when stitching together the two subproblems. For choices of g G G 
where dom(g*) is a compact set, this value is easy to bound; for instance, the logistic loss, where 



dom(5*) = [0,1], has w < sup^gd„n(/*) 
exponential loss, taking 5 {A e I 
always dual feasible, 

w'<sup||V/(AA)||i 
xes 



— 0„i||i = TO (since 0„i G dom(/*)). And with the 
f{AX) < f{AXo)} to denote the initial level set, since 0,„ is 

sup/(AA) = /(AAo) = TO. 

xes 



Note that rearranging the rate from Theorem 6.12 will provide that 0{l/e) iterations suffice to 



reach suboptimality e, whereas the earlier scenarios needed only C'(ln(l/e)) iterations. The exact 
location of the degradation will be pinpointed after the proof, and is related to the introduction of 



Proof of Theorem \6.1S\ By Theorem 
/(A+Af), thus 



5.9 



/a+ = fA, and the form of / gives f{AXt) ^ f{AoXt) + 

(6.13) 



fiAXt) -fA^ fiA^Xt) + f{A+Xt) - fA+. 
For the left term, since g{x) < f3\g'{x)\, 

fiAoXt) < ,9||V/(AoA0||i = mHAoXt) - PLo(W(AoAO)||i 
which used the fact (from Theorem |5.9[) that ^Ao = {Omo}- 



(6.14) 



For the right term of (6.13), recall from Theorem 5.9 that / + iiin(yi+) is 0-coercive, thus the 



level set S+ := {x £ M™+ : (/ + tim(A+))(2;) < f{AXo)} is compact. For all t, since / > and the 
objective values never increase, 

f{AXo) > f{AXt) = f{A^Xt) + f{A+Xt) > f{A+Xt)- 

in particular, A^Xt G 5*+. It is crucial that the level set compares against /(^Aq) and not /(A+Aq). 



Continuing, Lemma 6.7 may be applied to with value d = /(^Aq), which grants a tightest 

™+ containing 5+, and moreover V/(C+) H 'Kei{A^) ^ 0. Applying 



axis-aligned rectangle C+ C 
1+ and C+, J 

f{A+Xt) - fA^ < 



Lemma 6.8 to A^ and C+, / is strongly convex with modulus c > over C+, and for any t, 

^ 1V/(A+A0 - P^,(c,)nKcr(AT)(V/(^+A,))||?. 



2c' 



(6.15) 
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Next, set w := sup^ |lV/(^+At) - Pv/(c+)nKcr(AT)(^/(^+-^t))lli; w < co since S+ is compact and 
V f{C+) n Kcr(yl^) is nonempty. By the definition of w, 

Dv/(C+)nKer(AT)(V/(^+AO)' < «^D^/(C,.)nKcr(AT) A*)), 



wliicli combined with (6.15) yields 



To merge tfie subproblem dual distance upper bounds (6.141 and (6.16) via Lemma G.2 



be shown that (K™" x V/(C+)) n $a 7^ 0- But this follows by construction and Theorem 5.9 



(6.16) 

it must 
since 



{0,„} = <^Aa ^ 1^+, V/(C+) n ^Aj, ^ by Lemma 6.7 and the decomposition $^ = <^Ao x *i'A+- 
Returning to the total suboptimality expression ( |6.13[ ), these dual distance bounds yield 

the second step using Lemma [G.2[ 

To finish, note x Vf{C^) is polyhedral, and 

(Mr X V/(C+)) \ Ker(AT) D {V/(AAO}J^i \ Ker(AT) ^ 

since no primal iterate is optimal and thus V/(AAf) is not dual feasible by optimality conditions; 
combined with the above derivation (M™" x V/(C+))n$A ^ 0, Theorem 4.6 may be applied, meaning 
7(A, M™" X V/(C+)) > 0. As such, all conditions of Proposition 6.2 are met, and making use of 
f{A\t) < /(AAo), 



/(AAt+i) -fA< f{AXt) - Ja 



< f{AXt) - Ja 



7(A 



xV/(g+))^D?E70xv/(c,))nKo.(AT)(V/(^A0)^ 

6?7/(^At) 
r X Vf{C+))\f{A\) - hf 
677/(^Ao)(/3 + u;/(2c))2 



Applying Lemma |B . 1 1 1 with 



/(AAQ - /a 
/(^Ao) 



and 



min < 1 



7(A,Rr X V/(C+))- 
3r/(^ + w/(2c))2 



gives the result. 



□ 



In order to produce a rate C'(ln(l/e)) under attainability, strong convexity related the subopti- 
mality to a squared dual distance || • \\\ (cf. (6.10)). On the other hand, the rate 0(ln(l/e)) under 
weak learnability came from a fortuitous cancellation with the denominator f{AXt) (cf. (6.4)), 
which is equal to the total suboptimality since Theorem 5.2 provides Ja = 0. But in order to merge 
the subproblem dual distances via Lemma G.2 the differing properties granting fast rates must be 
ignored. (In the case of attainability, this process introduces w.) 

This incompatibility is not merely an artifact of the analysis. Intuitively, the finite and infinite 
margins sought by the two pieces Aq^A^ are in conflict. For a beautifully simple, concrete case of 
this, consider the following matrix, due to Schapire (2010): 



S := 



-1 
+1 
-1 
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The optimal solution here is to push both coordinates of A unboundedly positive, with margins 
approaching (0, 0, oo). But pushing any coordinate (A)i too quickly will increase the objective value, 
rather than decreasing it. In fact, this instance will provide a lower bound, and the mechanism of 
the proof shows that the primal weights grow extremely slowly, as 0{ln{t)). 

Theorem 6.17. Fix g — ln(l + exp(-)) e G, the logistic loss, and suppose the line search is exact. 
Then for any t > I, /(SXt) - fs > l/{8t). 

(The proof, in Appendix |G.6[ is by brute force.) 

Finally, note that this third setting does not always entail slow convergence. Again taking the 
view of the rows of S being points {— Si}iJ consider the effect of rotating the entire instance around 
the origin by 7r/4. The optimization scenario is unchanged, however coordinate descent can now be 
arbitrarily close to the optimum in one iteration by pushing a single primal weight extremely high. 
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A Common Notation 



Symbol 


Comment 


M™ 


m-dimensional vector space over the reals. 


M™ 


Non-negative m-dimensional real vectors. 


int(S') 


The interior of set S. 




Positive m-dimensional real vectors, i.e. int(K™). 








m-dimensional vectors of all zeros and all ones, respectively. 




Indicator vector: 1 at coordinate i, elsewhere. Context will provide the ambient 




dimension. 


Im(A) 


Image of linear operator A. 


Ker(A) 


Kernel of linear operator A. 


I'S 


Indicator function on a set S: 

, , r xgS, 



dom(/i) Domain of convex function h, i.e. the set {x G : h{x) < oo}. 
h* The Fcnchel conjugate of h: 

h*{(l))— sup {(l),x) — h(x). 



(Cf. Section [3] and Appendix B.2 ) 



0-coercive A convex function with all level sets compact is called 0-coercive (cf. Section 5.2 1. 
Go Basic loss family under consideration (cf. Section [2|. 

G Refined loss family for which convergence rates are established (cf. Section |6]). 

Parameters corresponding to some 5 G G (cf. Section p|. 
$A The general dual feasibility set: <^a ■= Ker(A^) n M!p (cf. Sectionjs]). 
7(^,5) Generalization of classical weak learning rate (cf. Section l4|). 

J A The minimal objective value of / o A: J a '■— 'vni\ f{A\) (ch Section [2]). 

'0^ Dual optimum (cf. Section [s]). 

P projection onto closed nonempty convex set S, with ties broken in some consistent 

manner (cf. Section |4]). 

P distance to closed nonempty convex set S: D^((/)) :— ||0 — P5(0)||p. 



B Supporting Results from Convex Analysis, Optimization, 
and Linear Programming 

B.l Theorems of the Alternative 

Theorems of the alternative consider the interplay between a matrix (or a few matrices) and its 
transpose; they are typically stated as two alternative scenarios, exactly one of which must hold. 
These results usually appear in connection with linear programming, where Parkas 's lemma is used 
to certify (or not) the existence of solutions. In the present manuscript, they are used to establish 
the relationship between Im(v4) and Ker(yl^), appearing as the first and fourth clauses of the various 
characterization theorems in Section [s] 

The first such theorem, used in the setting of weak learnability, is perhaps the oldest theorem of 



alternatives (see pantzig and Thapa 2003^ bibliographic notes, Section 5 of Chapter 2). Interestingly, 



a streamlined presentation, using a related optimization problem (which can nearly be written as 
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f o A from this manuscript), can be found in Borwein and Lewis] ( 2000[ Theorem 2.2.6). 
Theorem B.l (Gordan (Borwein and Lewis 



2000 



Theorem 2.2.1)). For any A e 



exactly 



30 e 



3X e M" , 

V" \ {0™} ■ 



AX e 

A'^ch-- 



one of the following situations holds: 

(B.2) 
(B.3) 

A geometric interpretation is as follows. Take the rows of yl to be m points in M". Then there 
are two possibilities: either there exists an open homogeneous halfspace containing all points, or 
their convex hull contains the origin. 

Next is Stiemke's Theorem of the Alternative, used in connection with attainability. 

Theorem B.4 (Stiemke (Borwein and Lewis 2000 Exercise 2.2.8)). For any A e M" 



exactly 



\ {o,n}; 



Or 



(B.5) 
(B.6) 



one of the following situations holds: 

3A e M" . G 

The geometric interpretation here is that either there exists a closed homogeneous halfspace 
containing all m points, with at least one point interior to the halfspace, or the relative interior 
of the convex hull of the points contains the origin (for the connection to relative interiors, see for 
instance 



Hiriart-Urruty and Lemarechal (2001 Remark A. 2. 1.4)) 



Finally, a version of Motzkin's Transposition Theorem, which can encode the theorems of alter- 



natives due to Farkas, Stiemke, and Gordan (Ben-Israel 2002). 



Theorem B.7 (Motzkin pantzig and Thapa[ |2003[ Theorem 2.16)). 
C G M^^", exactly one of the following situations holds: 



For any B G 



3(f>B G 



3A G 
\{Oj,0cel 



BX G 



ACA G M^, 



and 

(B.8) 
(B.9) 

For this geometric interpretation, take any matrix A G M™^", broken into two submatrices 
B G M^^" and C G M^^", with z + p = m; again, consider the rows of A as m points in M". The 
first possibility is that there exists a closed homogeneous halfspace containing all m points, the z 
points corresponding to B being interior to the halfspace. Otherwise, the origin can be written as a 
convex combination of these m points, with positive weight on at least one element of B. 

B.2 Fenchel Conjugacy 

The Fenchel conjugate of a function h, defined in Section [3) is 

h*((j)) = sup {x,4>) — h{x), 

where dom(/i) = {x : h(x) < oo}. The main property of the conjugate, indeed what motivated its 
definition, is that V/i*(V/i(a;)) = x (Hiriart-Urruty and Lemarechal, 2001 Corollary E.1.4.4). To 
demystify this, differentiate and set to zero the contents of the above sup: the Fenchel conjugate 



acts as an inverse gradient map. For a beautiful description of Fenchel conjugacy, please see Hiriart- 



Urruty and Lemarechal (2001 Section E.1.2). 

Another crucial property of Fenchel conjugates is the Fenchel- Young inequality, simplified here 
for differentiability (the "if" can be strengthened to "iff" via subgradients) . 

Proposition B. 10 (Fenchel- Young (Borwein and Lewis 2000 proposition 3.3.4)). For any convex 
function h and x G dom(/i), 



G dom(/i*), 

h{x) + h*{(l)) > (x. 



with equality if 4> — Vh(x). 



26 



B.3 Convex Optimization 



Two standard results from convex optimization will help produce convergence rates; note that these 
results can be found in many sources. 

First, a lemma to convert single-step convergence results into general convergence results. 



Lemma B.ll (Lemma 20 from Shalev-Shwartz and Singer ( |2008 )). Let 1 > ei > £2 > • • ■ be given 
with et+i < e* — ''ff for some r S (0, 1/2]. Then Ct < {r(t + 

Although strong convexity in the primal grants the existence of a lower bounding quadratic, it 
grants upper bounds in the dual. The following result is also standard in convex analysis, see for 
instance 



Hiriart-Urruty and Lemarechal (2001 proof of Theorem E.4.2.2) 



Lemma B.12 (Lemma 18 from Shalev-Shwartz and Singer (2008)). Let h be strongly convex over 
compact convex set S with modulus c. Then for any </>i,(/>i -I- </>2 S V/i(S'), 



h*{cl,,+cl,^)-h* < (Vr((/.i),(/.2)-f ^|i02|i 

2c 



C Basic Properties of g E 



^0 



Lemma C.l. Let any g e Gq be given. Then g is strictly convex, g > 0, g strictly increases (g' > Q), 
and g' strictly increases. Lastly, lim2._j.00 9{^) — 



00. 



Proof. (Strict convexity and g' strictly increases.) For any x < y, 

g'iy) - g'ix) + f g"(t)dt > g'ix) + iy~x) inf g"(t) > g'ix), 



thus g' strictly increases, granting strict convexity (Hiriart-Urruty and Lemarechal, 2001 Theorem 
B.4.1.4). 

(g strictly increases, i.e. g' > 0.) Suppose there exists y with g'{y) < 0, and choose any x < y. 
Since g' strictly increases, g'{x) < 0. But that means 

lim g{z) > lim g{x) + (z — x)g'{x) — 00, 

a contradiction. 

{g > 0.) If there existed y with g(jj) < 0, then the strict increasing property would invalidate 
lim^j^-oo gix) = 0. 

(lun^^ao gix) — 00.) Let any sequence {c^}^ f 00 be given; the result follows by convexity and 
g' > 0, since 

lim g{ci) > lim g{ci) + g'{ci){ci — ci) — 00. □ 

Next, a deferred proof regarding properties of g* for g e Gq. 

Proof of Lemma \37^ g* is strictly convex because g is differentiable, and g* is continuously differen- 
tiable on int(dom((7*)) because g is strictly convex (Hiriart-Urruty and Lemarechal, 2001 Theorems 
E.4.1.1. E.4.1.2) 



thus 



Next, when < 0: lim:E_>._oo g{x) = grants the existence of y such that for any x < y, g{x) < 1, 



g* {(f)) — sup^a; — g{x) > sup(f)X — 1 = 00. 



(g > precludes the possibility of 00 — 00.) 
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Take 



0; then 



9* (0) = sup -g{x) - inf g{x) = 0. 



When 4> = ^'(O), by the Fcnchel- Young inequahty (Proposition B.IO), 

g*{c^)^g*{g'm=Q-g'{Q)-g{Q) = -g{Q). 



Moreover Vg*{g'{Q)) = ( Hiriart-Urruty and Leniarechal[ 2001 Corollary E.1.4.3), which combined 
with strict convexity of g* means g'(0) minimizes g* . g* is closed ( Hiriart-Urruty and Lemarechal[ 
2001 Theorem E.1.1.2), which combined with the above gives that dom((7*) = [0,oo) or dom(5*) = 
[0, h\ for some 6 > 0, and the rest of the form of g* . □ 

Finally, properties of the empirical risk function / and its conjugate /*. 

Lemma C.2. Let any g Cz Gq be given. Then the corresponding f is strictly convex, twice con- 
tinuously differentiable, and V/ > 0„i. Furthermore, dom(/*) — dom(f/*)™ C K™, /*(0m) — 0, 
f* is strictly convex, f* is continuously differentiable on the interior of its domain, and finally 



Proof. First, 



rw = sup 



f{x) = sup ^Xi(j)^ - g{xi) = 



Next, strict convexity of 5* (cf. Lemma 3.2) means, for x ^ y, {V g* {x) — V g* (y), x — y) > 
(Hiriart-Urruty and Lemarechal 2001 Theorem E.4.1.4); thus, given (t>i,(t>2 G 



with 



strict convexity of /* follows from 

rn 

{^f*{(Pi) - Vr (</>2),</>i - </>2) = E i^9*{{<Pih) - V5*((02).), (0i). - {<P2W > 0. 



i=l 



The remaining properties follow from properties of g and g* (cf. Lemma C.l and Lemma 3.2 1. □ 



D Approximate Line Search 

This section provides two approximate line search methods for BOOST: an iterative approach, out- 
lined in Appendix |D.1| and analyzed in Appendix |D.2[ and a closed form choice, outlined in Ap- 
pendix |D]3) 



The iterative approach follows standard line search principles from nonlinear optimization ( Bert- 



sekas 1999 Nocedal and Wright 20061. It requires no parameters, only the ability to evaluate 
objective values and their gradients, and as such is perhaps of greater practical interest. Due to 
this, and the fact that its guarantee is just a constant factor worse than the closed form method, all 
convergence analysis will use this choice. 

The closed form step size is provided for the sake of comparison to other choices from the boosting 
literature. The drawback, as mentioned above, is the need to know certain parameters, specifically 
a second derivative bound, which may be loose. 

Before proceeding, note briefly that this section is the only place where boundedness of the entries 
of A is used. Without this assumption, the second derivative upper bounds would contain the term 
maxij Afj, which in turn would appear in the various convergence rates of Section [oj 
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D.l The Wolfe Conditions 



Consider any convex differentiable function h, a current iterate x, and a descent direction v (that is, 
Vh[x)'^v < 0). By convexity, the Hnearization of /i at a; in direction v, symboHcally h{x) +a'S/ h{x)'^ v , 
will lie below the function. But, by continuity, it must be the case that, for any ci € (0, 1), the ray 
h{x) + aci\/h{x)^v, depicted in Figure [sj must lie above h for some small region around x; this 



gives the first Wolfe condition, also known as the Armijo condition (cf. Nocedal and Wright] ( 2006 



Equation 3.4) and Bertsekas (1999 Exercise 1.2.16)) 



h{x + av) < h{x) + aciVh{x)^v. 



(D.l) 



Unfortunately, this rule may grant only very limited decrease in objective value, since a > can 
be chosen arbitrarily small and still satisfy the rule; thus, the second Wolfe condition, also called a 
curvature condition, which depends on C2 G (ci, 1), forces the step to be farther away: 



V/i(x + av)'^v > C2Wh{x)^v. 



(D.2) 



This requires the new gradient (in direction v) to be closer to 0, mimicking first order optimality 
conditions for the exact line search. Note that the new gradient (in direction v) may in fact be 
positive; this does not affect the analysis. 

In the case of boosting, with function foA, current iterate At, direction vt+i G {±6^^^^ } satisfying 
V(/ o A){Xt)^vt+i — — ||V(/ o A)(At)||oo, these conditions become 



{foA)iXt+avt+i) < {foA){Xt)~ac,\\VifoA)iXt)\l 
V(/ o A){Xt + avt+i)'^vt+i > -C2||V(/ o A){Xt)\\ao- 



(D.3) 
(D.4) 



An algorithm to find a point satisfying these conditions, presented in Figure [7] is simple enough: 
grow a as quickly as possible, and then bisect backwards for a satisfactory point. As compared with 
the presentation in Nocedal and Wright (2006 Algorithm 3.5), ctmax is searched for rather than 
provided, and convexity removes the need for interpolation. 

Proposition D.5. Given a continuously differentiable convex bounded below function h, iterate x, 



and direction v, WOLFE terminates with an a > satisfying (D.l I and (D.2) 



Proof. The bracketing search must terminate: w is a descent direction, so the linearization at Xt-i 
with slope ci\i'h{x)'^v will eventually intersect h (since h it is bounded below). 

The remainder of this proof is illustrated in Figure [8j Let ai be the greatest positive real 
satisfying (D.l); due to convexity, every a > satisfying this first condition must also satisfy 



a€ [0,ai]. Crucially, ai < amax- 

Next, let a2 be the smallest positive real satisfying (D.2 1; existence of such a point follows from 
the existence of points satisfying both Wolfe conditions (Nocedal and Wright 2006( Lemma 3.1). 
By convexity, 

{Vh{x + av) - Vh{x),v) > 0, 



and therefore every a > satisfying (D.2) must satisfy a > a2 



Finally, ai ^ a2, since ci < C2, meaning 



'Vh{x + aiw)^w = C2'Vh{x)'^v < ci'Vh{x)^v < \Jh{x + a2w)^ 



V. 



(D.2) 



Combining these facts, the interval [012, ai] is precisely the set of points which satisfy (D.3 1 and 
The bisection search maintains the invariants a„iin 

min 1 Q^inaxJ 



solution is ever thrown out: [a2, oti] C [a 
every bisection step halves the width of [a 



< a2 and amax ^ oti, meaning no valid 
[a2^ OLi] has nonzero width (since ai 7^ 012), and 
thus the procedure terminates. □ 
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Routine Wolfe. 

Input Convex function h, iterate x, descent direction v. 



Output Step size a > satisfying (D.l) and (D.2). 



D.4.1. Bracketing step. 

(a) Set ckmax := 1- 

(b) While ttinax satisfies (D.l): 

• Set Q!nia 



2an 



D.4.2. Bisection step, 
(a) Set Qfij 



;= and a :— 



./2. 



(b) While a does not satisfy (D.l I and (D.2 1 



i. If a violates (D.l ): 
• Set amax := a. 



ii. Else, a must violate (D.2): 

• Set amin 

iii. Set a := (amin + ainax)/2. 
(c) Return a. 



Figure 7: Bracketing and bisecting search for step size satisfying Wolfe conditions. 




Figure 8: The mechanism behind Wolfe: the set of points satisfying (D.l I and (D.2 1 is a closed 
interval, and bisection will find interior points. In this figure, dashed lines denote various relevant 
slopes. 
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D.2 Improvement Guaranteed by Wolfe Search 



The following proof, adapted from Nocedal and Wright ( 2006 Lemma 3.1), provides the improvement 
gained by a single line search step. The usual proof depends on a Lipschitz parameter on the gradient, 
which is furnished here by g"{x) < rjg{x). 

Proposition D.6 (Cf. [Nocedal and Wright (2006 Lemma 3.1)). Fix any g e G. If at+i is chosen 
by Wolfe applied to Junction f o A at iterate Xt in direction w^+i with c\ = 1/3 and C2 = 1/2, then 



fiA{Xt + at+iVt+i)) < f{AXt)- 



12 

I oo 



GvfiAXt) 

Proof. First note that every a G [0, at+i] satisfies 

f{A{Xt + avt+i))<f{AXt). 
By the fundamental theorem of calculus, 

(V(/ o A)iXt + at+ivt+i) - V(/ o A){Xt))'^vt+i 



at + l 



<at+i sup '^g"{ejA{Xt + aVf+i)){A,j^^,Y 

Qe[0,at+l] j—]^ 
m 

<mt+i sup ^g{el A{Xt+ avt+i)) 



< rjat+ifiAXt), 
which used boundedness of the entries in A. 



The rest of the proof continues as in Nocedal and Wright (2006 Theorem 3.2). Specifically, 
subtracting V(/ o A){Xt)'^vt+i from both sides of ( |D.4 ) yields 

(V(/ o A){Xt + at+ivt+i) ~ V(/ o A){Xt)yvt+^ > (c2 - 1)V(/ o A)(At)^«t+i. 

Combining these two gives 

(C2 - 1)V(/ O A){XtyVt+, (1 - C2)|| V(/ O A){Xt)\\oo 



at+l > 



vfiAXt) 



Plugging this into (D.3) yields 



if o A){Xt + at+ivt+i) < {foA)iXt) 



vfiAXt) 



ci(l-c2)HV(/oA)(A, 
vfiAXt) 



□ 



Note briefly that the simpler iterative strategy of backtracking line search is doomed to require 
knowledge of the sorts of parameters appearing in the closed form choice. 



D.3 Non-iterative Step Selection 

The same techniques from the proof of Proposition |D.6| can provide a closed form choice of at ■ In 
particular, it follows that any a G {a > : f{AXt) > f{A(Xt + avt+i))} is upper bounded by the 
quadratic 



f{A{Xt + avt+i)) < fiAXt) - a||A^V/(AAt)|U + 



a^vfiAXt) 
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This quadratic is minimized at 



\A^Vf{AXt 



moreover, this minimum is attained within the interval above, which in particular implies 

\A-^VfiA\,)\\l 



f{A{Xt + a'vt+i))<f(AXt)- 



When 77 is simple and tight, this yields a pleasing expression (for instance, rj — 1 when g — exp(-)). 
In general, however, rj might be hard to calculate, or simply very loose, in which case performing a 
line search like Wolfe is preferable. 



E Approximate Coordinate Selection 

Selecting a coordinate jt translates into selecting some hypothesis ht € Ji; this is in fact a key 
strength of boosting, since A need not be written down, and a weak learning oracle can select 
hf Cz H. But for certain hypothesis classes "H, it may be impossible to guarantee ht is truly the best 
choice. 

Observe how these statements translate into gradient descent. Specifically, the choice Vt+i made 
by boosting satisfies 

vJ^,V{foA)iXt) - vJ+.A'^S/fiAXt) = -pTV/(AAt)|loo. 

On the other hand, the usual choice v — — V(/ o A){Xt)/\\A^\7f{AXt)\\2 of gradient descent 
steepest descent) grants 

v'^VifoA)iXt)^-\\A^Vf{AXt)h; 
note that this choice of v is potentially a dense vector. 

Remark E.l. Suppose the relaxed condition that the weak learner need merely have any correlation 
over the provided distribution; in optimization terms, the returned direction v satisfies 

v'^W{foA){Xt)<0. 

This choice is not sufficient to guarantee convergence, let alone any reasonable convergence rate. As 
an example boosting instance, consider either of the matrices 



Ai := 



-1 


+1 





+f 


-I 





-f 


-I 











-I 



A2 := 



-1 +1 -1 
+1 -f -1 
-I -f -1 



the first of which uses confidence-rated predictors, the second of which is weak learnable; note that 



both instances embed the matrix S due to Schapire (20101, used for lower bounds in Section 6.3 



For either instance, ei, 62, ei, 62, ei, . . . is a sequence of descent directions. But, for either matrix, 
to approach optimality, the weight on the third column must go to infinity. 

A first candidate fix is to choose some appropriate cq > 0, and require 

v'^Wif o A)(Xt) < -co\\Wf{AXt)\\v, 



but note, by Proposition |4.2| and Theorem |5.2[ that this is only possible under weak learnability. 
(Dropping the term || V/(ylAt)|| 1 also fails; suppose A grants a minimizer A: plugging this in makes 
the left hand side exactly zero, and continuity thus grants arbitrarily small values.) 
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Instead consider requiring the weak learning oracle to return some hypothesis at least a fraction 
Co G (0, 1] as good as the best weak learner in the class; written in the present framework, the 
direction v must satisfy 

z;^V(/o A)(At) < -coP^V/(AAO||oo. 

Inspecting the proof of Proposition |6.2[ it follows that this approximate selection would simply 
introduce the constant Cq in all rates, but would not degrade their asymptotic relationship to sub- 
optimality e. 



F Generalizing the Weak Learning Rate 

F.l Choosing a Generalization to 7 

Any generalization 7' of 7 should satisfy the following properties. 

• When weak learnability holds, 7' = 7. 

• For any boosting instance, 7' G (0,oo). 



• 7' provides an expression similar to (4.5), which allows the full gradient to be converted into 
a notion of suboptimality in the dual. 



Taking the form of the classical weak learning rate from (4.1 1 as a model, the template generalized 
weak learning rate is 



-f'iA,S,C,D) inf 



pes\c inf^gsnD II'/'- V'lli ' 
for some sets S, C, and D (for instance, the classical weak learning rate uses S 



and C = L» 



{Om}). In order to provide an expression similar to (4.5), the domain of the infimum must include 
every suboptimal dual iterate Vf{AXt). 

Any choice C which does not include all of Kei{A^) is immediately problematic: this allows 
(/) e n Ker(A^) to be selected, whereby A^(p = Ot?i and 7' = 0. But note that without being 
careful about D, it is still possible to force the value 0. 
Remark F.l. Another generalization is to define 



7"(A) ■.^^'{A,R^,Ker{A''),{^{}) 



inf 

0eR!p\*^ 110 - II 



(F.2) 



This form agrees with the original 7 when weak learnability holds, and will lead to a very convenient 
analog to (4.5 1. 



Unfortunately, 7" may be zero. Specifically, take the matrix S defined in Section 6.3 due to 
Schapire (20101, where 



= 5'(0) 



Furthermore, for any a e (0, 1), define 
(pa ■— Ci G Im(S'); 

Then 



inf 

J!p\Ker(ST) 



l^'^^llc 



< inf 



\S 



tpa ■= (1 - Oi) 
1 + ■0a)l|oc 



1/2 
1/2 




^Slll "£(0,1) ||0„ 



^^lll 



inf 

qG(0,1 



+ G Ker(S'^). 

L -a J lloo 



1 



0. 







The natural correction to these worries is to set C = D = Kcr(^^). But there is still sensitivity 
due to S. 
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Remark F.S. Set A := I2, meaning Ker(^^) = {z(l, -1) : z e M}, and S" = B{l2, V^), the ball of 
radius V2 around I2; note that S D Ker(^^) O2. Consider 7'(A, S', Ker(A^), Ker(yl^)), and the 
sequence where 



1 



i + l 
i - 1 



Note that \\4>.i - I2II2 = \/2, thus G S. Furthermore, A^cpi 7^ 0, so 0j ^ S* n Ker(y4^). As such, 
y(A,^,Ker(A^),Ker(A^)) < inf J-^^'^'IU (f.4) 



\\l2V^m~[ltl]\\l 

Using < (1 + J/)/2, the numerator has upper bound 

\\lj (l2v/I^- [^IJ]) Hoc = \2V^^-2t\ 

= 2i{Vl + i-^ -1) 

< 2i{{2 + r^)/2 - 1) = 

The denominator is 



(F.S) 



||l2v/*^- [i±l] 111 = + 1)1 + \^PTi-{i - 1)1 

= {{i + 1) - + 1) + (^,2 + 1 _ (j _ 1)) 

= 2. 



Thus ( [K5| ) is bounded above by Mi{2i)-^ =0. 

The difhculty here was the curvature of S, which allowed elements arbitrarily close to Ker(^^) 
without actually being inside this subspace. This possibility is averted in this manuscript by requiring 
polyhedrality of S. This choice is sufficiently rich to allow the various dual-distance upper bounds 
of Section [H 

F.2 Proof of Theorem liTel 

The proof of Theorem |4.6| requires a few steps, but the strategy is straightforward. First note that 
j{A, S) can be rewritten as 

,(AS)= inf 



0GS\Kor(ylT) lit/- - P^nKor(AT)('^)ll 



inf 



l^^(</'-P5nKcr(AT)(0))ll 



0GS\Ker(AT) 1 1 </> - P^nKor(AT ) 1 1 1 

= inf I : « e M'" \ {0„}, B^eS.v^d^- P^nKcr(AT) (</>)} 

= inf {p^i-IU ■.\\v\\i = l,3cf>eS,3c>0.cv = (^- P^nKer(AT)(0)} , (F.6) 

where the second equivalence used PsnKci-{A'T)i^) — ^n- 

In the final form, v ^ Ker(^^), and so A^v ^ 0„; that is to say, the infimand is positive for 
every element of its domain. The difficulty is that the domain of the infimum, written in this way, 
is not obviously closed; thus one can not simply assert the infimum is attainable and positive. 
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The goal then will be to reparameterize the infimum to have a compact domain. For technical 
convenience, the result will be mainly proved for the P norm (where projections behave nicely), and 
norm equivalence will provide the final result. 

Lemma F.7. Given A e M"""" and a polyhedron S C with 5nKer(A^) 7^ and S\Ker{A^) ^ 



^H'^-P|nK..(AT)(^))l 



2 



II . p2 ^ /.,| --^^SX Ker(A ' ) > 0. (F.8) 

To produce the desired reparameterization of this infimum, the following characterization of 
polyhedral sets will be used. 

Definition F.9. For any nonempty polyhedral set S C M™, let J^s index a finite (but possibly 
empty) collection of affine functions <?„ : M™ — J' E so that S = Da^^slx S M™ : ga{x) < 0} (with 
the convention that S = M'" when J^s = 0)- For any x G S, let Is{x) denote the active set for x: 
a G 1s{x) iff ga{x) — 0. Lastly, define a relation over points in S: given x,y d S, x ^5 y iff 
Ts{x) = Ts{y). Observe that is an equivalence relation over points within S*, and let Cs be the 
set of equivalence classes. 

The equivalence relation ~s thus partitions S into the members of C5, each of which has a very 
convenient structure. 

Lemma F.IO. Let a polyhedral set S C M™ he given, and fix a nonempty F G Cs- Then F is 
convex, and F is equal to its relative interior (i.e., F = Ti(F)). Finally, fixing an arbitrary Zq € F, 
the normal cone at any point z d F is orthogonal to the vector space parallel to the affine hull of F 
(i.e., Nf{z) = (aff(F) - {z})^ - (aff(F) - {z^})^). 

Throughout the remainder of this section, normal and tangent cones will be considered at points 



within a set F e C5. As Lemma F.IO establishes, any set F e C5 is relatively open (F = ri(F)), 
however, the required properties of normal and tangent cones, as developed by |Hiriart-Urruty and| 
Lemarechal ( 2001[ Sections A. 5. 2 and A. 5. 3), suppose closed convex sets. But it is always the case 



that ri(F) = ri(cl(F)) ( Hiriart-Urruty and Lemarechal, 2001 Proposition A. 2. 1.8); as such, the 



normal and tangent cones at the desired relative interior points may just as well be constructed 
against c\{F), and thus the aforementioned properties safely hold. 

Proof. If = (meaning J^s is empty) or dim(F) = (F is a single point), everything follows 
directly, thus suppose S ^ IR™, and fix a nonempty F G Cs with dim(F) > 0. 

Let any Xq,Xi G F and /3 G [0,1] be given, and define a;^ := (1 — P)xq + j3xi. Since each g^ 
defining S is affine, 

ga{xi3) = (1 - l})ga{xo) + I3ga{xi). (F.H) 

By construction of Cs, gaixo) = iff ga{xi) = and otherwise both are negative, thus ga{xp) — 
iff ga{xo) = ga{xi) — 0, meaning 2s{xp) = IsIxq) = Is{xi), so xp G F and F is convex. 
Now let any i/q G F he given; j/q € ri(F) when there exists a 5 > so that 

Biyo,S)naS{F)CF (F.12) 



(Hiriart-Urruty and Lemarechal 2001', Definition A. 2. 1.1). To this end, first define S to be half the 



distance to the closest hyperplane defining S which is not active for yo'. 

S:=l min min{|ly' - yolU : y' S M", ^^(y') = 0}. 

Since there arc only finitely many such hyperplanes, and the distance to each is nonzero, S > 0. Let 
any G B{y,d) O aff(F) be given; by definition of aff(F), there must exist /3 e M and yi € F so 
that y,3 = (1 - l3)yo + /3yi. By ( |F.11D , for any a G Isivo) = ^sivi), 



gaiv/s) = (1 - /3)5a(yo) + Pgaijji) = 0. 
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On the other hand, for any a € J^sX^sivo), it must be the case that gaivp) < 0, since £ -8(2/0, S), 
and due to the choice of 6. Returning to the definition of relative interior in (F.12I, it follows that 
Uo G ri(i^), and ri(i^) = F since yo £ F was arbitrary. 

For the final property, for any Zo,z G ri(i^) = F, the tangent cone Tp{z) has form (aff(F) — {z}) 
(see [Hiriart-Urruty and Lemarechal |2001[ Proposition A. 5. 2.1 and discussion within Section A. 5. 3), 
and note aff(f ) - {z} = aff(F) + {z p - z} - {zq} = aff(F) - {zq}. Lastly, Nf{z) = Tf{z)^ 
(Hiriart-Urruty and Lemarechal 2001, Proposition A. 5. 2. 4). □ 



The relevance to (F.8I and (F.6) is that projections from polyhedron S onto S'nKer(A^) (itself 
a polyhedron, as is verified in the proof of Lemma F.7| must land on some equivalence class of 



SnKcr(AT)i ^nd these projections are easily characterized. 

Lemma F.13. Let any nonempty polyhedra S C M'" and K C 
F e CsnK o.'^^d xp £ F . Define 



he given, and fix any nonempty 



Pf 
Dp 



H<t> - Plnxm ■.c>o,cbeS, Plr^M) e F}, 

Np{xp)n{y-xp:yeR"\VaeIs{xp).g^{y)<0}, 



where Np(xp) is the normal cone of F at xp. Then Pp = Dp. 

Note that the final active set Is{xp) is with respect to S*, not S f^ K. 

Proof. (C) Let any (f) G S with il> ^'^snxi't') G F he given, where the latter is well-defined since 
F and hence S r\ K are nonempty. By Lemma F.IO -0 G ri(i^), and Np{ip) = Np{xp), meaning 
(j) — tp £ Np{xp) (Hiriart-Urruty and Lemarechal 2001, Proposition A. 5. 3. 3). Since (j> £ S, for any 

a £ Is{ip) = 1-s{xf) Q 'J^s, ga{4>) < 0, so 



{yeM" :5o(?/) <0}-{V^} 



{{y £ : g^{y) <Q}-{^- xp}) - {xp} 



the final equality following since ga{xp) = ga{'^) = and ga defines an affine hyperplane, meaning 
the corresponding afltine halfspace is closed under translations by — xp. This holds for all a £ 
Is{xp), thus (f) — £ Dp, and since Dp \s a. convex cone, for any c > 0, c{4> — V') G Dp- 
(D) Define 

5 min{||a;F - z||2 : « e M's\Is[xp),z £ M",ga(z) = 0}. 

For any fixed a, this minimum is positive since ga{xp) < 0, while polyhedrality of S grants that 
a ranges over a finite set, together meaning 5 > Q. Now let any v £ Dp be given, and set (j) ■— 
Xp + 5v/{2\\v\\2). The form of Dp immediately grants ga{(l>) < for a € Is{xp), but notice 
for a £ \ ^s{xf), it still holds that ga{(t>) < 0, since ga{xp) < and ||(/) — x_f||2 < S. So 
V = {2\\v\\2/S){(l) - PsnKi(t>)) where (j) £ S and P|nx('/') ^ xp £ F, meaning v £ Pp. □ 



The result now follows by considering all elements of Cgr]KcY{A'^)- 

Ker(A^). Note that K (and hence 5 n is a 



Proof of Lemma \F. 7| For convenience, set K 
polyhedron; indeed, it has the form 

K = Ker(A^) = {0 e M™ : 



A^<P - 0„} 

b<o}r\{(j)£ 



ejA^cP > 0}) . 
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Next, note CsnK has at least one nonempty equivalence class, since Sr\K is nonempty by assumption. 



Rewriting (|F.8[) as in (F.6I, and fixing an xp within each nonempty F £ CsnK, Lemma F.13 grants 



Q = inf {II A^^;||2 : \\v\\2^1,3c> 0,3^ eS.(f>~ P|nK(0) = cw} 

min M{\\A^v\\2 : ||t;||2 = 1, 3c> 0, e 5 . - P|nK(<^) = ci;, P|nA'(0) e ^} 

inf {||A^w||2 : ||w||2 = 1, V e NFixF),ya G Is{xf) ■ gc.{xF + v)<Q} 



mm 

F^C-snK 



Since S \ Ker(yl^) 7^ and S n Ker(A^), at least one infimum has a nonempty domain (for the 
others, take the convention that their value is +00). Each infimum with a nonempty domain in this 
final expression is of a continuous function over a compact set (in fact, a polyhedral cone intersected 
with the boundary of the unit ball), and thus it has a minimizer u, which corresponds to some 



c(0 - P|ni^('/')) ^ Ker(A^), where c> 0. It follows that 

lT-_„/iT^T d2 



.4'5 = cA'(0-P^ni^((/.))^O, 

meaning each of these infima is positive. But since S is polyhedral, C5 has finitely many equivalence 
classes (IC5I < 2l'^^l), meaning the outer minimum is attained and positive. □ 

Finally, as mentioned above, the desired result follows by norm equivalence. 



Proof of Theorem 4-6 For the upper bound, note as in the proof of Lemma F.7 that 5'nKer(^^) 7^ 



and the infimand is positive for every element of the domain, so the infimum is finite. For the lower 
bound, by Lemma [F.7| and norm equivalence. 



^{A,S)= mf — — — 

0eS\Kor(AT) mf^g5nKcr(AT) || <?> " -0 1 1 1 
y^mnj 0eS\Kor(AT) mf ^g5nKcr(AT) 110- -0112 

G Miscellaneous Technical Material 

G.l The Logistic Loss is within G 

Remark G.l. This remark develops bounds on the quantities 77,^ for the logistic loss g = ln(l + 
exp(-)). First note that the initial level set Sq :— {x € : f{x) < f{AXo)} is contained within 
a cube (—00,6]™, where b < mln(2); this follows since /(^Aq) — /(0„j) = m\n{2), whereas 
5(mln(2)) = ln(l + exp(mln(2))) > mln(2). 

For convenience, the analysis will be mainly written with respect to 6 mln(2). Let any 
X e {—oo,b] be given, and note g' = exp(-)/(l +exp(-)), and g" = exp(-)/(l +exp(-))^. 

To determine rj, note 1 < 1 + exp(x) < 1 + exp(&). Since In is concave, it follows for all 
z G [1, 1 + exp(5)] that the secant line through (1, 0) and (1 + exp(6), ln(l + exp(6))) is a lower bound: 

, ^ ^ ^ ^ ln(l + exp(b))-0 ^ ln(l + exp(b))-0 

^"(^^ - [ l + exp(6)-l ) ' l+cxp(b)-l = ^''^^ + ^"P(^)) ^"P(-^)(" - 
As such, for x € (—00, b], ln(l + exp(a;)) > exp(a;) ln(l + exp(6)) exp(— 6), so 

g"{x) _ exp{x) ^ exp(6) ^ exp(fo) 



g{x) (1 + exp(a;))2 1n(l +exp(x)) ~ (1 + exp(x))2 ln(l + exp(6)) ~ ln(l + exp(6)) ' 
Consequently, a sufficient choice is rj := exp(6)/ln(l + exp(6)) < 2™/(r7iln(2)). 
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For g(x) < f3g'{x), using ln(a;) < x — I, 

g{x) ln(l + cxp(x)) 



< 



That is, it suffices to set /3 := 1 



l+exp(x) 

exp(&) = 1 - 



exp(a;) 

cxp(a;) 
l+cxp(j:) 



< 1 + exp(&). 



G.2 Proof of Theorem B 



Proof of Theorem 3.4 Writing the objective as two Fenchel problems. 



Z4-inf/(^A) + iR„(A), 
d:=sup-r(-0)-4„(AT0). 



Since cont(/) — M™ (set of points where / is conti nuous) and doni( 
Adoni(tR»i) n cont(/) = Ini(^) ^ 0, thus d = Ja (Borwein and Lewis 
Moreover, since Ja < /(Om) and d > —f*{Om) 
theorem grants that it is attainable in the dual. 

To complete the dual problem, note for any A G M" that 



\ it follows that 
Theorem 3.3.5). 
0, the optimum is finite, and thus the same 



2000 



tR„(A) = sup (A,^) - tM>.(^) = i{o„}(A). 
From this, the term — t^„(A^(/)) allows the search in the dual to be restricted to (p G Ker(A^). Next 



replace G Ker(A ) with —ip G Ker(^ ), which combined with dom(/*) C (from Lemma C.2) 



means it suffices to consider ip G Ker(A ) H = ^a- (Note that the negation was simply to be 
able to interpret feasible dual variables as nonnegative measures.) 
Next, /*((/)) = J2i9*iWi) ■^^s proved in Lemma C.2 



Finally, the uniqueness of ipj 



Collins et al. 



(2002 



was established by 

direct argument is as follows by the strict convexity of /* (cf. Lemma C.2 1 
were some other optimal ip' ^ ip, the point {tp + tp')/2 is dual feasible and has strictly larger objective 
value, a contradiction. □ 



Theorem 1), however a 
Specifically, if there 



G.3 Proof of Proposition 5.4 



Proof of Proposition 5.. 



It holds in general that 0-coercivity grants attainable minima (cf. Hiriart- 



Urruty and Lemarechal (2001 Proposition B.3.2.4) and Borwein and Lewis (2000 Proposition 
1.1.3)). Conversely, let x with h{x) = \\\ixh{x) and any direction d G M"* with \\d\\2 = 1 be 
given. To demonstrate 0-coercivity, it suffices to show 

h{x + td)-h{x) „ 
hm ^ — > 

t— )-oo t 

( Hiriart-Urruty and Lemarechal[ 2001 Proposition B.3.2.4.iii). To this end, first note, for any i G K, 
that convexity grants 

h{x + td) > h{x + d) + {t-\) {Vh{x + d), d) . 

By strict monotonicity of gradients ( Hiriart-Urruty and Lemarechal[ 2001 Section B.4.1.4) and 
first-order necessary conditions {Vh{x) = □„), 



{\Jh{x + d), d) = {\Jh{x + d) - Wh{x),x + d - x) =: c> 0, 



Combining these. 



h(x + td)-h(x) h{x + d) + (t ~ l)c - h(x) 

hm ^ — > hm ^ ^ — = c> 0. 



□ 
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G.4 Proof of Lemma 16.71 



Proof of Lemma \Kl\ Since d > mix fiAX), the level set Sd {x e M™ : (/ + ii,n(yi))(a^) < d} is 
nonempty. Since = m, Theorem 5.5 provides f + i^im(A) is 0-coercive, meaning Sd is compact. 

Now consider the rectangle C defined as a product of intervals C = (8)™ ^[ai, bi], where 



mi{x, : X G Sd}, 



sup{xi : X e Sd}- 



By construction, C D Sd, and furthermore any smaller axis-aligned rectangle must violate some 
infimum or supremum above, and so must fail to include a piece of Sd- In particular, the tightest 
rectangle exists, and it is C 

Next, note that V/(a;) = {g'{xi),g'{x2), - - - ,g'{xm)), thus D = (8™ i5''([ai, an axis-ahgned 
rectangle in the dual. Since g is strictly convex and do'm{g) = M, both g'{ai) and g'{bi) are within 
int(dom(c/*)) (for all i), and so V/(C) C int(dom(/*)). 

Finally, Proposition 5.4 grants that / + tim(A) has a minimizer; thus choose any A G M" so 
that f{AX) = mix f (AX)- By optimality conditions of Fenchel problems, tp^ = Vf{AX) (cf. the 



optimality conditions in Borwein and Lewis (2000 Exercise 3.3.9.f), and the proof of Theorem 3.4 



where a negation was inserted into the dual to allow dual points to be interpreted as nonnegative 
measures). But the dual optimum is dual feasible, and AX S Sd, so 



v/(C) n $^ D {v/(AA)} n^A^ {^a} n $a ^ 



□ 



G.5 Splitting Distances along Ao,A^ 



Lemma G.2. Let A 



Ao 



be given as in Theorem 



Sq C M™n and S+ C M™+ and SO a ^9- Then, for any (f) = 



6.12: and let a set S — Sq x S^ be given with 
With (j>o e M™o and (p+ e 



Proof- Recall from Theorem 5.9 that $^ = ^Aq x ^a+, thus 



5 n $A = (5*0 n $Ao) X (5+ n <t>A, 



and Sn^A 7^ grants that Son<^Ao 7^ and 5'+n$A+ ^ 0- Define now the notation [•]„ : M™ 

and [•]+ : — >■ K™+, which respectively select the coordinates corresponding to the rows of ^0, 

and the rows of A+ . 



Let 6 — 



e M™ be given; in the above notation, (pQ = [<j)]o and 



Cartesian product and intersection properties, 



-) 



e Sn^A, 



By the above 



and so 



4>o 

0+ 



On the other hand, since Psn$A('^) ^ ("^o '^-^o) ^ ('^+ '^^+)' 
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G.6 Proof of Theorem iGlTl 

Proof of Theorem, \6.11\ This proof proceeds in two stages: first the gap between any solution with 
norm B is shown to be large, and then it is shown that the norm of the BOOST solution (under 
logistic loss) grows slowly. 

To s tart , Ker(S'^) = {z(l,l,0) : z G M}, and —g* is maximized at g'{0) with value —5(0) (cf. 
Lemma 



3^. Thus = (<?'(0),5'(0),0), and fs = -/* (^^) - 2g(0) = 21n(2). 



Next, by calculus, given any B, 



inf f{SX)~fs^f{S ) -21n(2) 

= (2 ln(2) + ln(f + exp(-B))) - 2 ln(2) 
= ln(f +exp(-B)). 

Now to bound the norm of the iterates. By the nature of exact line search, the coordinates 
of A are updated in alternation (with arbitrary initial choice); thus let ut denote the value of the 
coordinate updated in iteration t, and vt be the one which is held fixed. (In particular, vt — Ut-i-) 

The objective function, written in terms of {ut,Vt), is 

In (1 + exp(ut — Ut)) + In (1 + exp(Mt — vt)) + In (l + exp(— Uf — vt)) 
= In (2 + exp('i;t — ut) + exp{ut — Vt) + 2 exp(— ut — vt) + cxp(— 2ut) + cxp(— 2ui)) . 

Due to the use of exact line search, and the fact that Ut is the new value of the updated variable, 
the derivative with respect to ut of the above expression must equal zero. In particular, producing 
this equality and multiplying both sides by the (nonzero) denominator yields 

- exp(wt - Ut) + exp(-ut - vt) - 2 exp(-ut - vt) - 2 exp(-2Mf ) = 0. 

Multiplying by exp(uj + vt) and rearranging, it follows that, after line search, Ut and Vt must satisfy 

exp(2u4) = exp(2wf ) + 2 exp{vt — ut) +2. (G.3) 

First it will be shown for i > 1, by induction, that Ut > Vt- The base case follows by inspection 
(since un = vq = and so ui — ln(2)). Now the inductive hypothesis grants Ut > Vt; the case Ut — Vt 
can be directly handled by (G.3), thus suppose Ut > Vt- But previously, it was shown that the 
optimal bounded choice has both coordinates equal; as such, the current iterate, with coordinates 
{ut^ Vt), is worse than the iterate {ut, ut), and thus the line search will move in a positive direction, 
giving ut+i > vt+i. 

It will now be shown by induction that, for t > 1, < ^ ln(4t). The base case follows by the 
direct inspection above. Applying the inductive hypothesis to the update rule above, and recalling 
vt+i = Ut and that the weights increase (i.e., Ut+i > Vt+i = Ut), 

exp{2ut+i) = exp(2ut) + 2exp(Mt - ut+i) + 2 < exp(2ut) + 2exp(ut ~ ut) + 2 < At + i < 4:{t + 1). 

2 

To finish, recall by Taylor expansion that ln(l + 9) > 9 — ^; consequently for < > 1 

f{SXt)-fs> inf f{SX)-fs>ln(l + ^)>^-U^]>^ □ 
||A||i<ln(4t) \ At J At 2 \At J 8t 



40 



